Implementation of Sandhi Viccheda For Sanskrit Wordssentencesparagraphs
Implementation of Sandhi Viccheda For Sanskrit Wordssentencesparagraphs
net/publication/328959871
CITATIONS READS
0 448
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Bhagyashree D Patil on 16 February 2019.
Keywords— Sandhi splitting, lexical analysis, rules, sandhi viccheda algorithm, DAG
I. INTRODUCTION from left to right till the valid pada is found if so then
splitting is done. The system is fulfilled by adding all the
Sanskrit language and its grammar had exerted an
rules (like Vowel, Consonant, and Visarga) of sandhi
emphatic impact on Computer Science and related research
viccheda.
areas. Many more researchers have realized the importance
The paper is organized as the follows: Section I depicts
of Sanskrit language in NLP due to its computational
the introduction of Sanskrit and Sandhi viccheda. Section II
grammar. Famous Vedas or religious books are defined in
contains the related work of sandhi viccheda, Section III
the Sanskrit language such as SamVed, YajurVed, RigVed,
contains the proposed approach for sandhi viccheda, Section
AtharvaVed, Ramayana, Mahabharata, and Bhagwatgeeta
IV contains the result and discussion of the proposed
containing the complex words, sentences which user can’t
approach and Section V conclude research work with future
understand while reading. This data is the combination of
directions.
two or more words so it is difficult to understand. For that,
the system should be developed which will split those
II. RELATED WORK
words/sentences/paragraphs and the user will get actual word
of it.
Rupali Deshmukh et al., in [1], presented that natural
In Bhagwatgeeta, there are many difficult words or
language processing (NLP) is a field of artificial intelligence,
sentences which a user can’t understand easily. For that user
and linguistics, concerned with the interactions between
must know from which words that complex word or sentence computers and human languages. The combination of two
is made up of. So the main goal of this system is to provide immediate sounds, that means union, is called as sandhi
an environment for users such that the root words of that
formation. Sandhi-splitting describes the process by which
difficult sentence or paragraph are obtained. The system uses
one letter is split to form two words. Sandhi-splitting is one
the sandhi viccheda algorithm for splitting those words,
subtask for a complete analysis of input text in NL. The
sentences using DAG representation for possible outputs.
proposed system recognizes sandhi word from an input text
The proposed solution focuses on the sandhi viccheda of in Sanskrit; splits sandhi word and return what type of sandhi
the difficult words, sentences, and paragraphs which are it is. Mrs. Namrata Tapaswi et al., in [2], presents a set of
complicated to the user to understand. For this, the rules for
instructions for the formulation of LFG rules to parse
sandhi viccheda are used in the algorithm so that any word in
Sanskrit. Some simple sentences are parsed which verifies
it is split into its constituent words. The input is traversed
the rules follow with grammatical construction of language.
Priyanka Gupta et al., in [3], showed that sandhi which convert a word to its root form. The suffix and prefix are
means combining (of sounds). The Rule-based algorithm is removed from the word to extract the actual word from it.
proposed which gives an accuracy of 60-80% depending The accuracy of the system is up to 85% but there is
upon the number of rules to be implemented. Joshi Shripad drawback which is over stemming and under stemming. Anil
S., in [4], presented that rule-based algorithm and rules for Kumar, et al., in [12], presented an experience in Sanskrit for
sandhi splitting of Marathi words. Latha R. Nair and S. building an automatic paraphrase generator. 90% of
David Peter, in [5], described that morphological analyzers compounds are used by paraphrase handler to produce
are essential for any type of natural language processing correct paraphrases for which morphological analyzer is
works. As Malayalam like other Dravidian languages is an required. To know the meaning of the compound one should
agglutinative language it needs a compound word splitter as know relations between them. Pawan Goyal et al., in [13],
a preprocessor. An algorithm has been developed and proposed an approach called ‘utsarga apavaada’ for relation
successfully used for splitting the compound words. 90% analysis and morphological analysis for deterministic finite
success has been established in the initial scrutiny of around automata (DFA). By giving Sanskrit text, root words
4000 compound words. The splitter is also used for identified by the parser and gives dependency relations
developing and implementing a fully fledged morphological between semantic constraints. The proposed parser creates
analyzer. Devadath V. V. et al., in [6], presented that the semantic nets for Sanskrit paragraphs. Both the external and
accurate execution of sandhi splitting is crucial for text internal sandhi is implemented in the parser for Sanskrit
processing tasks such as POS tagging, topic modeling, and words. Bhagyashree Patil et al., in [14] showed the
document indexing. Different approaches to address the comparative study of the different systems for sandhi
challenges of sandhi splitting in Malayalam are tried, and viccheda of Sanskrit words.
finally, they have thought of exploiting the phonological
changes that take place in the words while joining. This III. PROPOSED APPROACH
resulted in a hybrid method which statistically identifies the
split points and splits using predefined character level In the proposed system, Sanskrit words, sentences and
linguistic rules. Currently, the system gives an accuracy of paragraphs are taken from Bhagwatgeeta. The system uses
91.1%. M. Rajani Shree et al., in [7], showed the novel the rules of sandhi viccheda on the input to split the given
approach towards internal sandhi splitting for Kannada data into its constituent words. The DAG representation is
language. Using CRF (Conditional Random Fields) tool near used to find out the possible ways of the output. The system
about 1000are tagged and 400 raw split words are given to works on all the sandhi viccheda rules. Here there are some
CRF tool. From the tagged list trained and testing data is examples of Bhagwatgeeta words:
produced. The accuracy using CRF tool for Kannada corpus
nearly equals to 98.08, 92.91 and 95.43. Sachin Kumar, in A. Examples of Bhagwatgeeta
[8], presented sandhi analyzer and splitter for the Sanskrit
language. Rule base method and lexical lookup are used in भवान्भीष्मश्च [ भवान ् भीष्मस ् च ] here
the system. Preprocessing, the lexical search is done on the visarga rule has been applied, visarga added
with च ् becomes श ्.
examples before sandhi analysis process. Subanta analysis
takes place respectively which looks up into lexicon for
avyaya and verb words to remove them from processing. In प्रथमोऽध्यायः [ प्रथमस ् अध्यायस ् ]
preprocessing, punctuations are removed. Split words
generated by reverse sandhi analysis are validated by subanta here the vowel sandhi rule has been applied, ओ
analysis. The system looks up for fixed word list of nouns,
names, and MWSDD which if found in processing are when added with अ becomes unchanged and
remain unchanged. Shubham Bhardwaj et al., in [9], avagraha is added.
भगवद्गीता [ भगवत ् गीता ] here the
proposed SandhiKosh which evaluates the accuracy and
completeness of Sanskrit sandhi tools. SandhiKosh uses five
different methods to provide a corpus which is complete and consonant sandhi rule is applied, when hard
able to give the performance of sandhi tools on Sanskrit consonant त ् combines with vowel इ becomes द्.
literature. Amba Kulkarni and Sheetal Pokar et al., in [10],
Sandhi means joining of two or more words to get the
developed a parser which finds a directed tree giving a graph
meaning of the sentence. But the sandhi viccheda (vigraha)
of nodes representing words and edges representing possible
means to separate two or more words to get their constituent
relations between them. Mimamsa constraint of akanksa is
words. Following are the examples of the sandhi viccheda
used to rule on nonsolution and sannidhi to prioritize the
rules:
solution. Vaishali Gupta et al., in [11], presented that Urdu is
a combination of several languages like Arabic, Hindi,
English, Turkish, Sanskrit etc. Here a stemmer is used to
B. Rules
visarga changes to स ् when followed by त ् or थ ्
Vowel Sandhi Rule: Consider the vowel sandhi
and the output is splitted into नमः and ते.
rule that (आ) is formed when (अ, आ)
Implementation of the system is done using the Python प्रर्दि ह्येनां + तन + यम्य +
environment. The version used for Python is 2.7 and the ज्ञानववज्ञाननाशनम ् भरत + ऋषभ + पा
dataset used is Bhagwatgeeta. While implementing the result,
+ अप ् + मानम ् +
it is in the form of Sanskrit language. The following Table 1
shows the analysis of sentences in Bhagwatgeeta. प्र + र्दिदि + एनम ्
+ ज्ञान + ववज्ञान +
Approximately 05 sentences have been analyzed in the
system.
नाशनम ्
Table 1. Analysis of Bhagwatgeeta Sentences
Sentences 5. इमां वववस्वते योगां इमम ् + वववस्वते +
Consonants ङ्ख, ण ्
15. Conson 01 00
08 तान्ब्रवीसम तान ् + ब्रवीसम ant
, न्
09 भीष्मासभरक्षितम ् भीष्म + असभ + रक्षितम ्
16.
च ् before छ् and antConson 02 01
10 भीष्ममेवासभरिन्त् भीष्मम ् + एव + असभ + न ् to ण ्
रिन्त्
Visarga to ‘ओ’
17. Visarga 02 00
The system analyzes the Bhagwatgeeta data (i.e. words, 18. Visarga is Visarga 05 01
sentences, and paragraphs) which show that rules in the dropped
Visarga to र्
Sanskrit language are implemented. As the objective is 19. Visarga 01 02
defined, the result shows that the rules are implemented in
the system and the analysis is done using Bhagwatgeeta data. 20.
Visarga to श,् ष,् Visarga 05 01
The analysis of rules of sandhi viccheda in Sanskrit has been
shown in Table 4.3. Each and every rule of sandhi viccheda स्
has been analyzed. Limitation for the system is that some 21. Visarga to ardha- Visarga 02 00
rules are not implemented due to some symbols in the visarga
Sanskrit language which system does not support. Total 57 10
Table 3. Analysis of Rules of Sandhi Viccheda
V. CONCLUSION AND FUTURE WORK
Sr Type No No. Sandhi viccheda is a technique in which a long word,
no. Sandhi . of of sentences, and paragraphs are separated into its constituent
rules rules words. It is a challenging task to implement in the Sanskrit
passe failed language itself. Till the vowel sandhi has implemented but
d adding the other rules the system increases the accuracy. The
1. Dirgha Sandhi Vowel 04 00 difficult task is to provide output in the Sanskrit language
Rule itself. The result is obtained when the Sanskrit word is given
2. Guna Sandhi Vowel 04 00 to the system as an input. Though there are many works has
Rule been done but the dataset used is different that is
3. Yana Sandhi Rule Vowel 03 01 Bhagwatgeeta. In future, the text summarization is done in
4. Vrddhi Sandhi Vowel 04 00 Sanskrit language using different methods so that the
Rule summary of that paragraph is understood by the user.
5. Ayadi Sandhi Vowel 06 00
Rule
6. Soft to hard Conson 00 01 REFERENCES
consonant ant
[1] Rupali Deshmukh and Varunakshi Bhojane, “Building Vowel
7. Hard to soft Conson 01 00
Sandhi Viccheda System for Sanskrit”, International Journal of
consonant ant
Innovations and Advancement in Computer Science, Vol. 4,
8.
ह् to 4th of class ant
Conson 01 00 December 2015.
[2] Mrs. Namrata Tapaswi, Dr. Suresh Jain and Mrs. Vaishali
9. Dental to palatal Conson 04 00 Chourey, “Parsing Sanskrit Sentences Using Lexical Functional
ant Grammar”, Proceedings of the International Conference on
10. Dental to celebral Conson 03 01 Systems and Informatics, IEEE, pp 2636-2640, 2012.
ant [3] Priyanka Gupta, Vishal Goyal, “Implementation of Rule-Based
Dental to ल ्
11. Conson 01 01 Algorithm for Sandhi-Viccheda of Compound Hindi Words”, IJCSI
ant International Journal of Computer Science Issues, Vol. 3, 2009.
12. Consonant to Conson 02 00 [4] Joshi Shripad S., “Sandhi Splitting of Marathi Compound Words”,
nasal of class ant International Journal on Advanced Computer Theory and
13.
श ् to छ् Conson 01 00 Engineering (IJACTE), Vol. 2, no. 02, 2012.
ant [5] Latha R. Nair, S. David Peter, “Development of a Rule-Based
म ् and न ् to
14. Conson 05 01 Learning System for Splitting Compound Words in Malayalam
ant Language”, Proceedings of the Recent Advances in Intelligent
anusvara Computational Systems, IEEE, pp 751-755, 2011.
Authors Profile
Miss Bhagyashree D. Patil pursuing Masters in Computer Science
and Engineering from Shrama Sadhana Bombay Trust, College of
Engineering Bambhori, Jalgaon in the year 2018.