0% found this document useful (0 votes)
126 views

Implementation of Sandhi Viccheda For Sanskrit Wordssentencesparagraphs

Implementation of Sandhi Viccheda for Sanskrit

Uploaded by

abhikokkad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
126 views

Implementation of Sandhi Viccheda For Sanskrit Wordssentencesparagraphs

Implementation of Sandhi Viccheda for Sanskrit

Uploaded by

abhikokkad
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/328959871

Implementation of Sandhi Viccheda for Sanskrit WordsSentencesParagraphs

Article  in  INTERNATIONAL JOURNAL OF COMPUTER SCIENCES AND ENGINEERING · September 2018


DOI: 10.26438/ijcse/v6i9.355360

CITATIONS READS
0 448

2 authors:

Bhagyashree D Patil Manoj Eknath Patil


Shram Sadhana Bombay Trust's College of Engineering and Technology Shram Sadhana Bombay Trust's College of Engineering and Technology
2 PUBLICATIONS   1 CITATION    13 PUBLICATIONS   17 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

A review on implementation of sandhi viccheda for sanskrit words View project

All content following this page was uploaded by Bhagyashree D Patil on 16 February 2019.

The user has requested enhancement of the downloaded file.


International Journal of Computer Sciences and Engineering Open Access
Research Paper Vol.-6, Issue-9, Sept. 2018 E-ISSN: 2347-2693

Implementation of Sandhi Viccheda for Sanskrit Words/ Sentences/


Paragraphs
Bhagyashree Patil1*, Manoj Patil2
1
Department of Computer Engineering, SSBT’s COET, Jalgaon, India
2
Department of Computer Engineering, SSBT’s COET, Jalgaon, India
*
Corresponding Author: [email protected], Tel.: 0257-2261096

Available online at: www.ijcseonline.org

Accepted: 19/Sept/2018, Published: 30/Sept/2018


Abstract—Sandhi is a technique in which joining of two or more words happens. Sandhi viccheda means splitting of word or
sentences into its constituents. Languages like Hindi, Urdu, Marathi, Kannada, and Malayalam are used to implement the
sandhi viccheda concept. Sanskrit is a language in which rules are used to form a word. These rules play an important role in
sandhi viccheda process. So, to split the word or sentence rules are used in the system. As everyone knows BHAGWAT
GEETA is a religious book which contains more complex words, sentences, which user can’t understand. The proposed system
uses those words, sentences from BHAGWAT GEETA for splitting purpose. For this, rules of sandhi viccheda like (vowels,
consonant, and visarga) are used. The representation of the splits is shown using the directed acyclic graph to get a number of
possible outputs.

Keywords— Sandhi splitting, lexical analysis, rules, sandhi viccheda algorithm, DAG

I. INTRODUCTION from left to right till the valid pada is found if so then
splitting is done. The system is fulfilled by adding all the
Sanskrit language and its grammar had exerted an
rules (like Vowel, Consonant, and Visarga) of sandhi
emphatic impact on Computer Science and related research
viccheda.
areas. Many more researchers have realized the importance
The paper is organized as the follows: Section I depicts
of Sanskrit language in NLP due to its computational
the introduction of Sanskrit and Sandhi viccheda. Section II
grammar. Famous Vedas or religious books are defined in
contains the related work of sandhi viccheda, Section III
the Sanskrit language such as SamVed, YajurVed, RigVed,
contains the proposed approach for sandhi viccheda, Section
AtharvaVed, Ramayana, Mahabharata, and Bhagwatgeeta
IV contains the result and discussion of the proposed
containing the complex words, sentences which user can’t
approach and Section V conclude research work with future
understand while reading. This data is the combination of
directions.
two or more words so it is difficult to understand. For that,
the system should be developed which will split those
II. RELATED WORK
words/sentences/paragraphs and the user will get actual word
of it.
Rupali Deshmukh et al., in [1], presented that natural
In Bhagwatgeeta, there are many difficult words or
language processing (NLP) is a field of artificial intelligence,
sentences which a user can’t understand easily. For that user
and linguistics, concerned with the interactions between
must know from which words that complex word or sentence computers and human languages. The combination of two
is made up of. So the main goal of this system is to provide immediate sounds, that means union, is called as sandhi
an environment for users such that the root words of that
formation. Sandhi-splitting describes the process by which
difficult sentence or paragraph are obtained. The system uses
one letter is split to form two words. Sandhi-splitting is one
the sandhi viccheda algorithm for splitting those words,
subtask for a complete analysis of input text in NL. The
sentences using DAG representation for possible outputs.
proposed system recognizes sandhi word from an input text
The proposed solution focuses on the sandhi viccheda of in Sanskrit; splits sandhi word and return what type of sandhi
the difficult words, sentences, and paragraphs which are it is. Mrs. Namrata Tapaswi et al., in [2], presents a set of
complicated to the user to understand. For this, the rules for
instructions for the formulation of LFG rules to parse
sandhi viccheda are used in the algorithm so that any word in
Sanskrit. Some simple sentences are parsed which verifies
it is split into its constituent words. The input is traversed
the rules follow with grammatical construction of language.

© 2018, IJCSE All Rights Reserved 355


International Journal of Computer Sciences and Engineering Vol.6(9), Sept. 2018, E-ISSN: 2347-2693

Priyanka Gupta et al., in [3], showed that sandhi which convert a word to its root form. The suffix and prefix are
means combining (of sounds). The Rule-based algorithm is removed from the word to extract the actual word from it.
proposed which gives an accuracy of 60-80% depending The accuracy of the system is up to 85% but there is
upon the number of rules to be implemented. Joshi Shripad drawback which is over stemming and under stemming. Anil
S., in [4], presented that rule-based algorithm and rules for Kumar, et al., in [12], presented an experience in Sanskrit for
sandhi splitting of Marathi words. Latha R. Nair and S. building an automatic paraphrase generator. 90% of
David Peter, in [5], described that morphological analyzers compounds are used by paraphrase handler to produce
are essential for any type of natural language processing correct paraphrases for which morphological analyzer is
works. As Malayalam like other Dravidian languages is an required. To know the meaning of the compound one should
agglutinative language it needs a compound word splitter as know relations between them. Pawan Goyal et al., in [13],
a preprocessor. An algorithm has been developed and proposed an approach called ‘utsarga apavaada’ for relation
successfully used for splitting the compound words. 90% analysis and morphological analysis for deterministic finite
success has been established in the initial scrutiny of around automata (DFA). By giving Sanskrit text, root words
4000 compound words. The splitter is also used for identified by the parser and gives dependency relations
developing and implementing a fully fledged morphological between semantic constraints. The proposed parser creates
analyzer. Devadath V. V. et al., in [6], presented that the semantic nets for Sanskrit paragraphs. Both the external and
accurate execution of sandhi splitting is crucial for text internal sandhi is implemented in the parser for Sanskrit
processing tasks such as POS tagging, topic modeling, and words. Bhagyashree Patil et al., in [14] showed the
document indexing. Different approaches to address the comparative study of the different systems for sandhi
challenges of sandhi splitting in Malayalam are tried, and viccheda of Sanskrit words.
finally, they have thought of exploiting the phonological
changes that take place in the words while joining. This III. PROPOSED APPROACH
resulted in a hybrid method which statistically identifies the
split points and splits using predefined character level In the proposed system, Sanskrit words, sentences and
linguistic rules. Currently, the system gives an accuracy of paragraphs are taken from Bhagwatgeeta. The system uses
91.1%. M. Rajani Shree et al., in [7], showed the novel the rules of sandhi viccheda on the input to split the given
approach towards internal sandhi splitting for Kannada data into its constituent words. The DAG representation is
language. Using CRF (Conditional Random Fields) tool near used to find out the possible ways of the output. The system
about 1000are tagged and 400 raw split words are given to works on all the sandhi viccheda rules. Here there are some
CRF tool. From the tagged list trained and testing data is examples of Bhagwatgeeta words:
produced. The accuracy using CRF tool for Kannada corpus
nearly equals to 98.08, 92.91 and 95.43. Sachin Kumar, in A. Examples of Bhagwatgeeta
[8], presented sandhi analyzer and splitter for the Sanskrit
language. Rule base method and lexical lookup are used in  भवान्भीष्मश्च [ भवान ् भीष्मस ् च ] here
the system. Preprocessing, the lexical search is done on the visarga rule has been applied, visarga added
with च ् becomes श ्.
examples before sandhi analysis process. Subanta analysis
takes place respectively which looks up into lexicon for
avyaya and verb words to remove them from processing. In  प्रथमोऽध्यायः [ प्रथमस ् अध्यायस ् ]
preprocessing, punctuations are removed. Split words
generated by reverse sandhi analysis are validated by subanta here the vowel sandhi rule has been applied, ओ
analysis. The system looks up for fixed word list of nouns,
names, and MWSDD which if found in processing are when added with अ becomes unchanged and
remain unchanged. Shubham Bhardwaj et al., in [9], avagraha is added.
भगवद्गीता [ भगवत ् गीता ] here the
proposed SandhiKosh which evaluates the accuracy and

completeness of Sanskrit sandhi tools. SandhiKosh uses five
different methods to provide a corpus which is complete and consonant sandhi rule is applied, when hard
able to give the performance of sandhi tools on Sanskrit consonant त ् combines with vowel इ becomes द्.
literature. Amba Kulkarni and Sheetal Pokar et al., in [10],
Sandhi means joining of two or more words to get the
developed a parser which finds a directed tree giving a graph
meaning of the sentence. But the sandhi viccheda (vigraha)
of nodes representing words and edges representing possible
means to separate two or more words to get their constituent
relations between them. Mimamsa constraint of akanksa is
words. Following are the examples of the sandhi viccheda
used to rule on nonsolution and sannidhi to prioritize the
rules:
solution. Vaishali Gupta et al., in [11], presented that Urdu is
a combination of several languages like Arabic, Hindi,
English, Turkish, Sanskrit etc. Here a stemmer is used to

© 2018, IJCSE All Rights Reserved 356


International Journal of Computer Sciences and Engineering Vol.6(9), Sept. 2018, E-ISSN: 2347-2693

B. Rules
visarga changes to स ् when followed by त ् or थ ्
 Vowel Sandhi Rule: Consider the vowel sandhi
and the output is splitted into नमः and ते.
rule that (आ) is formed when (अ, आ)

combines with (अ, आ). Then in the example,


The proposed system works for sandhi viccheda of complex
ववद्यालयः will be splitted into ववद्या and Sanskrit words, sentences and paragraphs. Sandhi viccheda
gives the user the root words of the complex word which
आलयः. Here in this example the word, will be help the user to understand what it means to. The system uses
द्याल where the आ is
the sandhi viccheda algorithm and DAG representation for
splitted at the point
the possible outcomes came in the output. Graphs are
formed when combined with the आ of आलयः networks consisting of nodes connected by edges or arcs. In
directed graphs, the connections between nodes have a
word. Consider the next example that (अर्) is direction, and are called arcs; in undirected graphs, the
connections have no direction and are called edges.
formed when (अ, आ) combines with (ऋ, ॠ). Algorithms in graphs include finding a path between two
ग्रीष्मतः्त is splitted at the point ष्मतः्त
nodes, finding the shortest path between two nodes,
Example
determining cycles in the graph (a cycle is a non-empty path
where अर् is found when combined with ऋ. from a node to itself), finding a path that reaches all nodes
(the famous “traveling salesman problem”), and so on. Here
the Networkx package has been taken into account to get the
 Consonant Sandhi Rule: Class soft consonant DAG representation of the result i.e. the output is displayed
except nasal followed by hard consonant in possible 10 paths using the shortest path.
changes to 1st consonant of class. Consider the
example एतत्पततत which has the C. Sandhi Viccheda Algorithm
एतद् and पततत as its constituent words. Here  Procedure Sanskrit Lexical analyzer
Input: Sanskrit word / sentence / Paragraph
in this example द् is, a soft consonant combined  Action: Traverse the word/sentence, splitting it (or
with प ् hard consonant forms the word not) at each location to determine all possible valid
splits.
एतत्पततत. The next example is न ् at the end of  Traverse from left to right
 Using recursion (with memoization), assemble the
a word changes to [anusvara and श ् (palatal results of all choices

sibilant)] when followed by (च,् छ्) (palatal hard
To split or not to split at each phoneme
 If split, all possible left/right combination of
consonants). Consider the example काांश्श्चत ् phonemes that may result
 Once split, check if the left section is a valid pada
where anusvara and श ् is converted to its (use level 1 tool to pick pada type and tag
न ् and च.् It means that काांश्श्चत ् is
morphologically).
original  If the left section is valid, proceed to split the right
splitted into कान ् and चचत.् section
 At the end of step 2, the system will have all
possible syntactically valid splits with
 Visarga Sandhi Rule: Consider the visarga morphological tags
sandhi rule that visarga is dropped when आ&  Output: All semantically valid sandhi split
sequences
visarga standing for आस ् is followed by a
vowel or soft consonant. For example, In the Algorithm of the system, the Sanskrit word or sentence
नरा गच्छश्न्त
has been given as an input to the system. The word or
splitted into sentence then traversed from left to right till all the splits are
नराः and गच्छश्न्त. Here visarga has been not found. When the left section has been completed finding
all the splits then right section has been started, after that user
dropped. The other example is नमस्ते; here will get all the valid splits with morphological tags.

© 2018, IJCSE All Rights Reserved 357


International Journal of Computer Sciences and Engineering Vol.6(9), Sept. 2018, E-ISSN: 2347-2693

4. तस्मात्त्वसमश्न्ियाण्यार्ौ तस्मात ् + त्वम ् +


IV. RESULTS AND DISCUSSION तनयम्य भरतषतभ पाप्मानां इश्न्ियाणण + आर्ौ

Implementation of the system is done using the Python प्रर्दि ह्येनां + तन + यम्य +
environment. The version used for Python is 2.7 and the ज्ञानववज्ञाननाशनम ् भरत + ऋषभ + पा
dataset used is Bhagwatgeeta. While implementing the result,
+ अप ् + मानम ् +
it is in the form of Sanskrit language. The following Table 1
shows the analysis of sentences in Bhagwatgeeta. प्र + र्दिदि + एनम ्
+ ज्ञान + ववज्ञान +
Approximately 05 sentences have been analyzed in the
system.
नाशनम ्
Table 1. Analysis of Bhagwatgeeta Sentences
Sentences 5. इमां वववस्वते योगां इमम ् + वववस्वते +

Sr Input Sentence Output प्रोक्तवानिमव्ययम ् योगम ् + प्र +


no. वववस्वान्मनवे प्राि उक्तवान ् + अिम ् +
1. ज्यायसी चेत्कमतणस्ते मता ज्यायसी + चेत ् + मन्ररक्ष्वाकवेऽब्रवीत ् अव्ययम ् +
बव् िर्तनार्त न तश्त्कां कमतणण घोरे कमतणस ् + ते + वववस्वान ् + मनवे +
माां तनयोर्यसस केशव अमता + ब्विस ् + प्र + आि + मन्स ्
र्न + आर्त न ् + + इक्ष्वाकवे +
अतत ् + ककम ् + अब्रवीत ्
कमतणण + घोरे +
माम ् + तन + From Bhagwatgeeta 500 words have been analyzed showing
10 words in the following Table 2.
योर्यसस + केशव
2. लोकेऽश्स्मन ् द्ववववधा तनष्ठा लोके + अश्स्मन ् + Table 2. Analysis of Bhagwatgeeta Words

प्रा प्रोक्ता मयानघ ज्ञानयोगेन द्ववववधा + तनष्ठा


Bhagwatgeeta Words
साङ्ख्यानाां कमतयोगेन + पर् ा + प्र + उक्ता
योचगनाम ् + मयान ् + अघ + Sr Input word Output
no.
ज्ञान + योगेन +
साङ्ख्यानाम ् + कमत 01 प्रथमोऽध्यायः प्रथमस ् + अध्यायस ्
+ योगेन +
02 पाण्डवाश्चैव पाण्डवास ् + च + एव
योचगनाम ्
3. ॐ तत्सदर्तत ओम ् + तत ् + सत ् 03 र्य
् ोधनस्तर्ा र्य
् ोधनस ् + तर्ा
श्रीमद्भगवद्गीतासूपतनषत्स् + इतत + श्रीमत ् +
ब्रह्मववद्यायाां योगशास्रे भगवत ् + गीतास् +
04 आचायतम्पसांगम्य आचायतम ् + उप + सम ् +

श्रीकृष्णार््तनसांवार्े साङ्ख्ययोगो उपतनषत्स् + ब्रह्म गम्य

नाम द्ववतीयोऽध्यायः + ववद्यायाम ् + 05 पाण्ड्पर


् ाणामाचायत पाण्ड् + पर
् ाणाम ् +
योग + शास्रे + श्री
आचायत
+ कृष्णा + अर्न
्त +
सांवार्े + साङ्ख्य + 06 भीमार्न
्त समा भीम + अर्न
्त + समा
योगस ् + अनाम +
07 प्रुश्र्त्क्श्न्तभोर्श्च प्रु + श्र्त ् + क्श्न्त +
द्ववतीयस ् +
भोर्स ् + च
अध्यायस ्

© 2018, IJCSE All Rights Reserved 358


International Journal of Computer Sciences and Engineering Vol.6(9), Sept. 2018, E-ISSN: 2347-2693

Consonants ङ्ख, ण ्
15. Conson 01 00
08 तान्ब्रवीसम तान ् + ब्रवीसम ant
, न्
09 भीष्मासभरक्षितम ् भीष्म + असभ + रक्षितम ्
16.
च ् before छ् and antConson 02 01
10 भीष्ममेवासभरिन्त् भीष्मम ् + एव + असभ + न ् to ण ्
रिन्त्
Visarga to ‘ओ’
17. Visarga 02 00

The system analyzes the Bhagwatgeeta data (i.e. words, 18. Visarga is Visarga 05 01
sentences, and paragraphs) which show that rules in the dropped
Visarga to र्
Sanskrit language are implemented. As the objective is 19. Visarga 01 02
defined, the result shows that the rules are implemented in
the system and the analysis is done using Bhagwatgeeta data. 20.
Visarga to श,् ष,् Visarga 05 01
The analysis of rules of sandhi viccheda in Sanskrit has been
shown in Table 4.3. Each and every rule of sandhi viccheda स्
has been analyzed. Limitation for the system is that some 21. Visarga to ardha- Visarga 02 00
rules are not implemented due to some symbols in the visarga
Sanskrit language which system does not support. Total 57 10
Table 3. Analysis of Rules of Sandhi Viccheda
V. CONCLUSION AND FUTURE WORK
Sr Type No No. Sandhi viccheda is a technique in which a long word,
no. Sandhi . of of sentences, and paragraphs are separated into its constituent
rules rules words. It is a challenging task to implement in the Sanskrit
passe failed language itself. Till the vowel sandhi has implemented but
d adding the other rules the system increases the accuracy. The
1. Dirgha Sandhi Vowel 04 00 difficult task is to provide output in the Sanskrit language
Rule itself. The result is obtained when the Sanskrit word is given
2. Guna Sandhi Vowel 04 00 to the system as an input. Though there are many works has
Rule been done but the dataset used is different that is
3. Yana Sandhi Rule Vowel 03 01 Bhagwatgeeta. In future, the text summarization is done in
4. Vrddhi Sandhi Vowel 04 00 Sanskrit language using different methods so that the
Rule summary of that paragraph is understood by the user.
5. Ayadi Sandhi Vowel 06 00
Rule
6. Soft to hard Conson 00 01 REFERENCES
consonant ant
[1] Rupali Deshmukh and Varunakshi Bhojane, “Building Vowel
7. Hard to soft Conson 01 00
Sandhi Viccheda System for Sanskrit”, International Journal of
consonant ant
Innovations and Advancement in Computer Science, Vol. 4,
8.
ह् to 4th of class ant
Conson 01 00 December 2015.
[2] Mrs. Namrata Tapaswi, Dr. Suresh Jain and Mrs. Vaishali
9. Dental to palatal Conson 04 00 Chourey, “Parsing Sanskrit Sentences Using Lexical Functional
ant Grammar”, Proceedings of the International Conference on
10. Dental to celebral Conson 03 01 Systems and Informatics, IEEE, pp 2636-2640, 2012.
ant [3] Priyanka Gupta, Vishal Goyal, “Implementation of Rule-Based
Dental to ल ्
11. Conson 01 01 Algorithm for Sandhi-Viccheda of Compound Hindi Words”, IJCSI
ant International Journal of Computer Science Issues, Vol. 3, 2009.
12. Consonant to Conson 02 00 [4] Joshi Shripad S., “Sandhi Splitting of Marathi Compound Words”,
nasal of class ant International Journal on Advanced Computer Theory and
13.
श ् to छ् Conson 01 00 Engineering (IJACTE), Vol. 2, no. 02, 2012.
ant [5] Latha R. Nair, S. David Peter, “Development of a Rule-Based

म ् and न ् to
14. Conson 05 01 Learning System for Splitting Compound Words in Malayalam
ant Language”, Proceedings of the Recent Advances in Intelligent
anusvara Computational Systems, IEEE, pp 751-755, 2011.

© 2018, IJCSE All Rights Reserved 359


International Journal of Computer Sciences and Engineering Vol.6(9), Sept. 2018, E-ISSN: 2347-2693

[6] Devadath V V, Litton J Kurisinkel, Dipti Misra Sharma, and


Vasudeva Varma, “A Sandhi Splitter for Malayalam”, Proceedings
of the 11th International Conference on Natural Language
Processing, 2014.
[7] M. Rajani Shree, Sowmya Lakshmi, “A novel approach to Sandhi
splitting at Character level for Kannada Language”, Proceedings
of the 2016 International Conference on Computational Systems
and Information Systems for Sustainable Solutions, IEEE, pp 17-
20, 2016.
[8] Sachin Kumar, “Sandhi Splitter and Analyzer for Sanskrit”, With
Special Reference to aC Sandhi, Special Centre for Sanskrit
Studies, Jawaharlal Nehru University, New Delhi, 2007.
[9] Shubham Bhardwaj, Neelamadhav Gantayat, Nikhil Chaturvedi,
Rahul Garg, Sumeet Agarwal, “SandhiKosh: A Benchmark Corpus
for Evaluating Sanskrit Sandhi Tools”, Language Resources and
Evaluation Conference, 2018.
[10] Amba Kulkarni, Sheetal Pokar and Devanand Shukl, “Designing a
Constraint-Based Parser for Sanskrit”, Proceedings of the
International Sanskrit Computational Linguistics Symposium,
SpringerLink, vol. 6465, pp 70-90, 2010.
[11] Vaishali Gupta, Nisheeth Joshi, Iti Mathur, “Rule-Based Stemmer
in Urdu”, Proceedings of the 2013 4th International Conference on
Computer and Communication Technology, IEEE, pp 129-132,
2013.
[12] Anil Kumar, V.Sheebasudheer, Amba Kulkarni, “Sanskrit
Compound Paraphrase Generator”, Proceedings of the ICON,
2009.
[13] Pawan Goyal, Vipul Arora, and Laxmidhar Behera, “Analysis of
Sanskrit Text: parsing and Semantic Nets”, Springerlink,
Proceedings of the Sanskrit Computational Linguistics, Vol. 5402,
pp 200-218, 2009.
[14] Bhagyashree D. Patil and Manoj E. Patil, “A Review on
Implementation of Sandhi Viccheda for Sanskrit Words”,
Proceedings of the International Conference in ICGTETM, IJCRT,
vol.5, no. 12, pp 489-493, December 2017.

Authors Profile
Miss Bhagyashree D. Patil pursuing Masters in Computer Science
and Engineering from Shrama Sadhana Bombay Trust, College of
Engineering Bambhori, Jalgaon in the year 2018.

Mr. Manoj E. Patil pursed Ph.D. and currently working as Assistant


Professor in Department of Computer science and engineering,
North Maharashtra University, Jalgaon. He is a member of ISOC,
IAENG. He has published more than 15 research papers in reputed
international journals. He has 13 years of teaching experience.

© 2018, IJCSE All Rights Reserved 360

View publication stats

You might also like