Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese

Uploaded by

rickshark

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views

Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese

Uploaded by

rickshark

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Analysis and Evaluation of Stemming Algorithms: A case

Study with Assamese

∗ † ‡
Navanath Saharia Utpal Sharma Jugal Kalita
Department of CSE Department of CSE Department of CS
Tezpur University Tezpur University University of Colorado
Napaam, India-784028 Napaam, India-784028 Colorado Springs, USA-80918
[email protected] [email protected] [email protected]

ABSTRACT It is one of the initial steps in analyzing the morphology

Stemming is the process of automatically extracting the of words. A number of approaches have been reported by
base form of a given word of a language. Assamese is a researchers for different language groups. The reported ap-
morphologically rich, relatively free word order, Indo-Aryan proaches can be classified into two board categories. Rule
language spoken in North-Eastern part of India that uses based technique[10, 14], unsupervised technique[7, 16]. The
Assamese-Bengali script for writing. As it is among the Rule based approach produces the best result either rela-
less computationally studied languages, our aim is to ex- tively fixed word order or less inflectional languages. But
tract stem from a given word. We adopt the suffix stripping for highly inflectional language, we need more complete lin-
approach along with a rule engine that generates all the guistic knowledge of the formation of words, to handle more
possible suffix sequences. We found 82% accuracy with the complex derivational and inflectional morphology. To ex-
suffix stripping approach after adding a root-word list of size tract the root from an inflected word, we should have de-
20,000 approximately. tailed knowledge of word formation of that language. For
highly inflectional languages some probable ways of forming
Keywords words are listed below.
Stemming, Inflectional language, Assamese
1. Combining more than one root word.
1. INTRODUCTION
2. Partially combining more than one root word.
In any information retrieval system (IR), one of the first
tasks is to extract and index the information available in the 3. Adding tense, aspect or mood markers.
document in the form of words or terms. As most Indian
languages are highly inflectional (as a result morphologically 4. Word-to-word translation from another language.
rich), they are adversely affected by the abundance of words
5. Partially combining more than one root word from two
appearing in various morphological forms. Indexing all the
languages.
words is tedious as well as uninformative for further process-
ing. To reduce the morphological variation in indexing the
most common method is to represent the words in a normal- After passing through any of the steps mentioned above, a
ized representative form. One such approach is the process root word has changed either in terms of morphology, syntax
of finding the stem or root. Though linguistically root and or semantics. An unsupervised approach needs linguistics
stem has a different meaning, in this report we alternately resource including corpus. Thus an unsupervised approach
use both words to mean stem word from an inflected form. may not produce better results for resource poor languages
∗Navanath Saharia is currently pursuing his Ph.D. in Com- like most Indian languages including Assamese. There are
puter Science and Engineering, Tezpur University, India. also a number of published works that combine both the
†Utpal Sharma is working as an Associate Professor in the above mentioned approach, these we refer as hybrid.
Department of Computer Science and Engineering, Tezpur
University, India. Assamese is one of the major languages in the north-eastern
‡Jugal Kalita is a Professor in the Department of Computer parts of India with approximately 30 million native speak-
Science, Colorado University at Colorado Springs, USA. ers. In this study, we take into consideration the problem of
stemming Assamese texts in which stemming is particularly
hard due to the common appearance of single letter suffixes
Permission to make digital or hard copies of all or part of this work for as morphological inflections. In our study we observed that
personal or classroom use is granted without fee provided that copies are more than 50% of the inflections in the Assamese language
not made or distributed for profit or commercial advantage and that copies appear as single letter suffixes. Such single letter morpho-
bear this notice and the full citation on the first page. To copy otherwise, logical inflections cause ambiguity while predicting the un-
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
derlying root word.
ICACCI’12, August 3-5, 2012, Chennai, T Nadu, India.
Copyright 2012 ACM 978-1-4503-1196-0/12/08…$10.00. The rest of the paper is organized as follows. In section
2 we describe previous work related to stemming followed

842
by linguistic characteristics of Assamese. Next section 4 separately from the root word. For example we found
describes our approach and section 5 give the result and suffix <-samUh> more than 500 times occurring as a
analysis of the approach. Finally, Section 6 concludes our single word in a corpus of size 2.6 million, which is
paper. grammatically wrong. There are a number of other
irregularities specifically in case of hyphenated words.
2. RELATED WORK
3. Sometimes a whole word can appear as a separate suf-
Among the reported work, Porter stemmer [10], an itera-
fix or we can say a suffix can appear as a separate
tive rule based approach is most successful and used widely
word. For example <-dal> is used to mark indefinite
in different applications such as spell-checking and morph
plural marker animated noun, when <dal> appears as
analyzing. Lovin’s approach [5] and minimum description
a separate word it means a kind of aquatic grass or a
length [3] are among other popular stemming algorithms.
group of people.
In the Indian language context a number of (although very 4. There are some suffixes with single characters such as
few) hand-crafted rule based stemmers have been reported, <-r> indicates the genitive case marker, <-t> denotes
whose primary idea is to strip off suffixes. Among these [12] locative case marker these create problems with differ-
used a hand crafted suffix list and stripped off the longest ent types of non-inflectional noun or verb words. For
suffix for Hindi and reported 88% accuracy for their ap- example the word mAnuhar ends with <-r > and is in
proach using a dictionary of size 35,997. The work reported genitive case whereas amar is also ends with <-r >,
by [7] learned suffix stripping rules from a corpus and used but here <-r > is not a case marker, this <-r > is the
clustering to discover the nearest class of the root word for part of the stem.
Bengali, English and French. They described a centroid
based approach that rewards the longest common prefix to 5. In case of some onomatopoeic words or re-duplicative
form similar word clusters based on a threshold value. The hyphenated words both parts may take inflection to
work of [8] focused on some heuristic rules for Hindi and emphasize the word. For example-
reported 89% accuracy. [1] proposed a hybrid form of [7],
[8] for Hindi and Gujarati with precisions of 78% and 83%, dAngare-dAngare kAjiyA karise.
respectively. Their approach took both prefixes as well as TF1 : elder+NOM-elder+NOM quarrel do+PPT
suffixes into account. They used a dictionary and suffix re- ET2 : elders are quarrelling.
placement rules and claimed that the approach is portable
and fast. In this sentence the bold hyphenated token, i.e., dAngar-
dAngar is inflected with a nominative case marker (NOM)
Reported work of [4] for Punjabi used a dictionary of size to emphasize on the ”elder”, which actually caused to
52,000 and found 81.27% accuracy using a brute-force ap- implies a plural sense. In certain cases numbers, for-
proach. [6] described a hybrid method (rule based + suf- eign words and symbols are also inflected depending on
fix stripping + statistical) for Marathi and found 82.50% use. In such situations suffix stripping method fails.
precision for their system. Work in Malayalam [11] used a
dictionary of size 3,000 and reported 90.5% accurate system 6. A root can take more than one suffix in a sequence.
using finite state machines. We have found only [16, 9] as Our study reveals that an Assamese noun root can
example of state of the art of Assamese stemming, where [16] have maximum 30,000 inflected forms in the worst
use unsupervised approach to acquire the language specific case. So our aim is to extract the sequence of suf-
stemming rules. fixes that a root can have. During addition of a suffix
sequence, based on the sound some morphophonemic
changes are made. For example <kar>(kar: to do) is
3. LANGUAGE RELATED ISSUES a verb root ending with a consonant, and any suffix
Assamese is a morphologically rich Indic language spoken
string added after the rootkar that starts with a full
in north-eastern part of India. Among the reported work
vowel (for example <owA>, 2nd person causative verb
on Assamese [15, 16, 13, 14] are the main. Like most Indic
marker) the new inflected form will be karowA. The
language, in Assamese also, only suffixes and prefixes are
full vowel <-o> is changed to vowel modifier. As the
attached to the starting and ending of a word. Though it
new form ends with a vowel sound <-aA>, any ad-
seems simple, there are certain characteristics that compli-
dition after that does not undergo morphophonemic
cate the Assamese stemming process.
change.

1. A sub-string may not always appear as a suﬃx in word

karowaiCil = kar+owA + iCil
construction of Assamese. For example <-bor> and <-
mAn> denotes indefinite plural marker but in case of = do + 2nd person causative verb marker
ketbor, kisumAn we can not extract <-bor> and <- + Past perfect tense marker.
mAn> as suffix. Likewise <-jan> is used to mark sin-
gular number masculine gender animated noun. But Here the bold faced o is vowel modifier and the bold
it is not the case with the word prayojan, we can not faced i is full vowel. Thus identifying this type of in-
extract <-jan> as a singular number masculine gender flection is very tedious or sometime impossible using
animated noun denoted suffix. the simple suffix stripping method.
1
2. The corpus we use to process has a number of mis- Transliterate form
2
spelled words and sometimes some suffixes are written Approximate English Translation

International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012) 843

4. OUR APPROACH Algorithm 1 Stemming Approach-1
In this study we use a part of EMILLE3 corpus of size 123753 1: Read a line from the corpus file.
words. Table 1 provides basic the statistics of the dataset 2: Extract words (from this point we called it as token)
used and Table 2 shows the top most frequent words along from the line, clean the token, that is remove punctua-
with part of speech tag and frequency from the. For this tion marker attached with token if there is one.
experiment we consider the suffix stripping approach to find 3: Look up suffix-list generated manually from the end of
stems in Assamese. We found that the suffix stripping ap- the token. If matched with the suffix-list extract and
proach can stem words with accuracy 61%. After that we exit.
added a root-word list of size approximately 20,000 in the 4: Go to step 1 until the end of the corpus.
second approach.

length equal to 1 is highest, 56%. It is clear from this ex-

Number of assertive sentences 7689 periment that the error rate will decrease with increasing
Number of interrogative sentences 218 suffix length. As the error rate of single character suffix is
Number of exclamatory sentences 93 highest, one possible solution to increase the accuracy is to
Total number of words 123753 add a root-word list, which is discussed in section 4.2.
Number of words in the longest sentence 112
Number of words in the shortest sentence 1 4.2 Approach-2
Unique words in the dataset 25111 With the existing rule-set or knowledge base we were not
Mean word length 5.85 able to classify all the words of language into a well-defined
category. Each language has such words that cause trouble.
Table 1: Statistics of the test dataset To handle this type of exception we maintain a word-list,
where most frequent and exceptional root words are stored.
4.1 Approach-1 Thus in this approach first the word-list is checked against
As mentioned above an Assamese root can take a series of each word to be stemmed after that we apply algorithm 1.
suffixes sequentially , so our first aim is to find the proba- The main advantage of the approach is that it minimizes
ble sequence of suffixes that a root has. We manually col- the over-stemming and under-stemming error. We give the
lected all possible suffixes and categorised them into four steps mentioned in 2
basic groups, viz., case marker, plural marker, classifier and
emphatic marker. After that the rule engine generates a Algorithm 2 Stemming Approach-2
list of suffixes. The rule engine is a module that generates 1: Read a line from the corpus file.
all probable suffix sequences that may be attached after a 2: Extract words (from this point we called it as token)
root, based on the affixation rule of Assamese. For example from the line, clean the token, that is remove punctua-
an Assamese noun root may take the following sequence of tion marker attached with token if there is one.
suffixes - 3: Check the dictionary. If a dictionary entry matches with
the token, mark token as root word and exit otherwise
execute the next step.
1. root + plural marker (PL)
4: Look up suffix-list generated manually from the end of
Example: mAnuhbor = mAnuh + bor
the token. If there is a match with the suffix-list extract
2. root + case marker (CM) and exit.
Example: mAnuhar = mAnuh + r 5: Go to step 1 until the end of the corpus.

3. root + plural marker + case marker + emphatic marker

(EM) 5. EXPERIMENTAL RESULT
Example: mAnuhborarhe = mAnuh + bor + r + he) The accuracy found using approach 1 is 61%, and using ap-
etc.. proach 2 is 82%. The complete statistics of our experiment
is tabulated in Table 3 and Table 4. In comparison with
[16, 9] the result produced by approach 2 with only 20,000
We confirmed 14 such rules and generated approximately root-word list look-up is considerably better.
18,194 suffix sequences for Assamese. The next phase was to
extract the longest possible sub-string from the given word Approach -1
using a suffix-list look-up. Among the generated suffixes, 11 Correctly stemmed 61%
suffixes were single word suffix and this single word suffixes Incorrectly stemmed 39%
caused rapid the down-fall of the accuracy, as they created Stemmed as no inflection 24%
the problem mentioned in section 3. The pseudo-code for Stemmed as single character inflection 56%
this approach given in 1.
Stemmed as multiple inflection 20%
After this experiment we found that, 61% words are cor-
Table 3: Obtained result using approach-1
rectly stemmed. It is observed that the error rate for in-
flected word with suffix length greater than 4 is less that
We evaluate the strength of approach 1 and approach 2 using
1%, whereas the error rate for inflected words with suffix
[2]. Table 5 and Table 6 shows the evaluation result of the
3
http://www.emille.lancs.ac.uk/ both approaches. The index compression factor for approach

844 International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)

Sl. Assamese word Approximate Translation POS tag Frequency is inﬂected?
1 aAru and Particle 2249 No
2 mai me Pronoun 1105 No
3 ai this/it Demonstrative 1087 No
4 mor me + Genitive marker pronoun 1063 Yes
5 teon he Pronoun 765 No
6 aAsil be + Past tense Verb 751 Yes
7 teonr he + Genitive case marker pronoun 641 Yes
8 kari do + Present Participle Verb 640 Yes
9 cei that Demonstrative 639 No
10 kintu but Particle 583 No

Table 2: First 10 most frequent word in the dataset

Approach -2 7. REFERENCES
Correctly stemmed 82% [1] N. Aswani and R. Gaizauskas. Developing
Incorrectly stemmed 18% morphological analysers for south asian languages:
Stemmed as no inflection 23% Experimenting with the Hindi and Gujarati languages.
Stemmed as single inflection 57 % In Proceedings of the Seventh conference on
Stemmed as multiple inflection 20% International Language Resources and Evaluation
(LREC), 2010.
Table 4: Obtained result using approach-2 [2] W. B. Frakes and C. J. Fox. Strength and similarity of
affix removal stemming algorithms. SIGIR Forum,
37(1):26–30, Apr. 2003.
1 is 0.28 whereas index compression factor for approach 2 is [3] J. Goldsmith. An Algorithm for the Unsupervised
0.31. Learning of Morphology. Natural Language
Engineering, 1, 1998.
Approach -1
[4] D. Kumar and P. Rana. Design and development of a
Words in the test file 123753
stemmer for Punjabi. International Journal of
Unique words before stemming 25111
Computer Applications, 11, 2011.
Unique words after stemming 18012
[5] J. B. Lovins. Development of a stemming algorithm.
Min. word length after stemming 1
Mechanical Translation and Computational
Max. word length after stemming 14
Linguistics, 11, 1968.
Mean number of words 1.11
[6] M. M. Majgaonker and T. J. Siddiqui. Discovering
Mean stemmed word length 4.03
suffixes: A case study for Marathi language.
Index compression factor 0.28 International Journal on Computer Science and
Engineering, 04:2716–2720, 2010.
Table 5: Evaluation of approach-1 using [2] [7] P. Majumder, M. Mitra, S. Parui, G. Kole, P. Mitra,
and K. Datta. YASS: Yet another suffix stripper. ACM
Approach -2 Transanctions and Information Systems, 25, 2007.
Words in the test file 123753 [8] A. Pandey and T. Siddiqui. An unsupervised Hindi
Unique words before stemming 25111 stemmer with heuristic improvements. In In
Unique words after stemming 17218 Proceedings of the second workshop on Analytics for
Min. word length after stemming 1 noisy unstructured text data, page 99âĂŞ105, 2008.
Max. word length after stemming 14 [9] M. Parakh and R. N. Developing morphology analyser
Mean number of words 1.36 four Indian languages using a rule based suffix
Mean stemmed word length 4.55 stripping approach. Language In India, pages 13–15,
Index compression factor 0.31 2011.
[10] M. F. Porter. An algorithm for suffix stripping.
Table 6: Evaluation of approach-2 using [2] Program, 14:130âĂŞ137, 1980.
[11] V. S. Ram and S. L. Devi. Malayalam stemmer. In
6. CONCLUSION Morphological Analysers and Generators, pages
In this report we analyse and evaluate two stemming tech- 105–113, 2010.
niques with a dataset of size 123753 words. In the first [12] A. Ramanathan and D. D. Rao. A lightweight
approach we describe the suffix stripping approach with the stemmer for Hindi. In Proceedings of the 10th
suffix generated by the rule engine, there is always a problem Conference of the European Chapter of the Association
of over-stemming and under-stemming with this approach. for Computational Linguistics (EACL), on
In the second approach we use a frequent root-word with suf- Computatinal Linguistics for South Asian Languages,
fix stripping, which increase the accuracy from 61% to 82%. 2003.
In comparison with [16, 9], the result produced by approach [13] N. Saharia, D. Das, U. Sharma, and J. Kalita. Part of
2 with only 20,000 root-word list look-up is considerable. speech tagger for Assamese text. In Proceedings of the

International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012) 845

ACL-IJCNLP 2009 Conference Short Papers, pages
33–36, Suntec, Singapore, 2009.
[14] N. Saharia, U. Sharma, and J. Kalita. A suffix-based
noun and verb classifier for an inflectional language. In
International Conference on Asian Language
Processing, pages 19–22. IEEE Computer Society,
2010.
[15] U. Sharma, J. Kalita, and R. Das. Root word
stemming by multiple evidence from corpus. In
Proceedings of 6th International Conference on
Computational Intelligence and Natural Computing
(CINC-2003), 2003.
[16] U. Sharma, J. K. Kalita, and R. K. Das. Acquisition
of Morphology of an Indic Language from Text
Corpus. ACM Transactions on Asian Language
Information Processing (TALIP), 7, 2008.

846 International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)

Aarts - NEW English Syntax and Argumentation Fifth Edition
No ratings yet
Aarts - NEW English Syntax and Argumentation Fifth Edition
219 pages
декиgghvbghbn
No ratings yet
декиgghvbghbn
80 pages
A Suffix Based Morphological Analysis of Assamese Word Formation
100% (2)
A Suffix Based Morphological Analysis of Assamese Word Formation
5 pages
Implemented Stemming Algorithms For Six Ethiopian Languages
No ratings yet
Implemented Stemming Algorithms For Six Ethiopian Languages
5 pages
Ajanta Norwegian in Two Months
0% (2)
Ajanta Norwegian in Two Months
9 pages
Design and Implementation of Morphology Based Spell Checker
No ratings yet
Design and Implementation of Morphology Based Spell Checker
9 pages
Implementation of A New Method For Stemming in Persian Language
No ratings yet
Implementation of A New Method For Stemming in Persian Language
5 pages
Welcome To International Journal of Engineering Research and Development (IJERD)
No ratings yet
Welcome To International Journal of Engineering Research and Development (IJERD)
4 pages
Rule Based Stemmer in Urdu: Vaishali Gupta, Nisheeth Joshi, Iti Mathur
No ratings yet
Rule Based Stemmer in Urdu: Vaishali Gupta, Nisheeth Joshi, Iti Mathur
4 pages
GivingBackToTheCommunity Paper
No ratings yet
GivingBackToTheCommunity Paper
25 pages
A_rule-based_approach_of_stemming_for_inflectional_and_derivational_words_in_Bengali
No ratings yet
A_rule-based_approach_of_stemming_for_inflectional_and_derivational_words_in_Bengali
3 pages
D AR B H L: Esign OF ULE Ased Indi Emmatizer
No ratings yet
D AR B H L: Esign OF ULE Ased Indi Emmatizer
8 pages
Assas-Band, An Affix-Exception-List Based Urdu Stemmer: Qurat-ul-Ain Akram Asma Naseer Sarmad Hussain
No ratings yet
Assas-Band, An Affix-Exception-List Based Urdu Stemmer: Qurat-ul-Ain Akram Asma Naseer Sarmad Hussain
8 pages
Sindhi Morphological Analysis: An Algorithm For Sindhi Word Segmentation Into Morphemes
No ratings yet
Sindhi Morphological Analysis: An Algorithm For Sindhi Word Segmentation Into Morphemes
10 pages
urmi2016-2
No ratings yet
urmi2016-2
5 pages
Hindi Morphological Analyzer and Generator
No ratings yet
Hindi Morphological Analyzer and Generator
4 pages
Amharic Part-of-Speech Tagger For Factored Language Modeling
No ratings yet
Amharic Part-of-Speech Tagger For Factored Language Modeling
7 pages
Morphology: Sindhi Morphological Analysis For Natural Language Processing Applications
No ratings yet
Morphology: Sindhi Morphological Analysis For Natural Language Processing Applications
5 pages
Lecture05-01
No ratings yet
Lecture05-01
12 pages
Accent Classification Among Punjabi, Urdu, Pashto, Saraiki and Sindhi Accents of Urdu Language
No ratings yet
Accent Classification Among Punjabi, Urdu, Pashto, Saraiki and Sindhi Accents of Urdu Language
8 pages
Module 3 - Part 1
No ratings yet
Module 3 - Part 1
54 pages
Cpms Long Iwlc 06
No ratings yet
Cpms Long Iwlc 06
19 pages
Morphological Processing of Semitic Languages
No ratings yet
Morphological Processing of Semitic Languages
14 pages
18.+Functional+Morphology+Patterns+Rules+Sets+for+Morphological+Parsing+of+Urdu+Regular+Verbs (1)
No ratings yet
18.+Functional+Morphology+Patterns+Rules+Sets+for+Morphological+Parsing+of+Urdu+Regular+Verbs (1)
10 pages
ISCA Archive: Duration Modeling of Indian Languages Hindi and Telugu
No ratings yet
ISCA Archive: Duration Modeling of Indian Languages Hindi and Telugu
6 pages
A Novel Unsupervised Corpus-Based Stemming
No ratings yet
A Novel Unsupervised Corpus-Based Stemming
16 pages
Sentence Transaltion For Kannada Using M
No ratings yet
Sentence Transaltion For Kannada Using M
5 pages
Morpheme Segmentation For Kannada Standing On The Shoulder of Giants
No ratings yet
Morpheme Segmentation For Kannada Standing On The Shoulder of Giants
16 pages
Normalizing The Hindi Text
No ratings yet
Normalizing The Hindi Text
8 pages
Dicción 1
No ratings yet
Dicción 1
52 pages
An Analysis of Difficulties in Punjabi Language Automation Due To Non-Standardization of Fonts
No ratings yet
An Analysis of Difficulties in Punjabi Language Automation Due To Non-Standardization of Fonts
8 pages
Research On Regional Languages
No ratings yet
Research On Regional Languages
6 pages
Morphological Analyzer For Tamil
No ratings yet
Morphological Analyzer For Tamil
37 pages
Root Identification Tool For Arabic Verbs
100% (1)
Root Identification Tool For Arabic Verbs
7 pages
Verb Phrase Morphology
No ratings yet
Verb Phrase Morphology
9 pages
Rule Based Hindi To Urdu Transliteration System: Bushra78622@yahoo - in
No ratings yet
Rule Based Hindi To Urdu Transliteration System: Bushra78622@yahoo - in
5 pages
Amharic Document Representation For Adhoc Retrieval: Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie
No ratings yet
Amharic Document Representation For Adhoc Retrieval: Tilahun Yeshambel, Josiane Mothe, Yaregal Assabie
13 pages
Development of Punjabi WordNet
No ratings yet
Development of Punjabi WordNet
18 pages
Development of Amharic Grammar Checker Using Morphological Features of Words and N-Gram Based Probabilistic Methods
No ratings yet
Development of Amharic Grammar Checker Using Morphological Features of Words and N-Gram Based Probabilistic Methods
7 pages
An Unsupervised Approach To Develop Stemmer
No ratings yet
An Unsupervised Approach To Develop Stemmer
9 pages
A Study of Myanmar Word Segmentation Schemes For Statistical Machine Translation
No ratings yet
A Study of Myanmar Word Segmentation Schemes For Statistical Machine Translation
13 pages
1 2013-Maforkannadaverbs
No ratings yet
1 2013-Maforkannadaverbs
7 pages
V4i4 0338
No ratings yet
V4i4 0338
8 pages
Lecture03-03
No ratings yet
Lecture03-03
7 pages
27pakistan Social Sciences Review
No ratings yet
27pakistan Social Sciences Review
18 pages
Cor 002 Reviewer
No ratings yet
Cor 002 Reviewer
9 pages
Verb Morphological Generator For Telugu
No ratings yet
Verb Morphological Generator For Telugu
11 pages
English-Hindi Transliteration Using Multiple Similarity Metrics
No ratings yet
English-Hindi Transliteration Using Multiple Similarity Metrics
8 pages
Jorge
No ratings yet
Jorge
13 pages
Thesis Review On Information Extraction From Text Knowledge Poor Approach
No ratings yet
Thesis Review On Information Extraction From Text Knowledge Poor Approach
7 pages
Al Grammar Larsen Freeman
No ratings yet
Al Grammar Larsen Freeman
51 pages
보건복지사이버평생교육원 www.bbedu.co.kr
No ratings yet
보건복지사이버평생교육원 www.bbedu.co.kr
112 pages
Lecture-3 (Words - Transducers)
No ratings yet
Lecture-3 (Words - Transducers)
61 pages
An Unsupervised Approach To Develop Stemmer
No ratings yet
An Unsupervised Approach To Develop Stemmer
9 pages
A Rule-Based Afan Oromo Grammar Checker
No ratings yet
A Rule-Based Afan Oromo Grammar Checker
5 pages
A Survey of Applications of Finite Automata in Natural Language Processing
No ratings yet
A Survey of Applications of Finite Automata in Natural Language Processing
3 pages
IITC 2008p4 PDF
No ratings yet
IITC 2008p4 PDF
10 pages
A Rule-Based Lemmatizing Approach For Sinhala Language: Maheshi Nandathilaka Supunmali Ahangama G. Thilini Weerasuriya
No ratings yet
A Rule-Based Lemmatizing Approach For Sinhala Language: Maheshi Nandathilaka Supunmali Ahangama G. Thilini Weerasuriya
5 pages
Parts of Speech Tagging For Afaan Oromo
No ratings yet
Parts of Speech Tagging For Afaan Oromo
5 pages
Learning Morphological Rulesfor Amharic Verbsusing Inductive Logic Programming
No ratings yet
Learning Morphological Rulesfor Amharic Verbsusing Inductive Logic Programming
7 pages
Chapter (I) Ny
No ratings yet
Chapter (I) Ny
9 pages
4 Types of Grammar G 1 PRESENATION
67% (3)
4 Types of Grammar G 1 PRESENATION
11 pages
Genitive in Hindi
From Everand
Genitive in Hindi
Anil Thakur
No ratings yet
KEA Practical Automatic Keyphrase Extraction
No ratings yet
KEA Practical Automatic Keyphrase Extraction
2 pages
Document Centered Approach To Text Normalization
No ratings yet
Document Centered Approach To Text Normalization
8 pages
Classifying Arabic Web Pages Toolkit
No ratings yet
Classifying Arabic Web Pages Toolkit
4 pages
A Comparative Study For Arabic Text Classification Algorithms Based On Stop Words Elimination
No ratings yet
A Comparative Study For Arabic Text Classification Algorithms Based On Stop Words Elimination
5 pages
An Unsupervised Model For Text Message Normalization
No ratings yet
An Unsupervised Model For Text Message Normalization
8 pages
A Stop List For General Text
No ratings yet
A Stop List For General Text
17 pages
A Suggestion-Based RDF Instance Matching System: January 2017
No ratings yet
A Suggestion-Based RDF Instance Matching System: January 2017
6 pages
How Good Is Your Model?: Andreas Müller
No ratings yet
How Good Is Your Model?: Andreas Müller
54 pages
Dynamic Discovery of Type Classes and Relations in Semantic Web Data
No ratings yet
Dynamic Discovery of Type Classes and Relations in Semantic Web Data
26 pages
2019 Book CyberSecurity PDF
No ratings yet
2019 Book CyberSecurity PDF
184 pages
Objective Proficiency Pag 10
No ratings yet
Objective Proficiency Pag 10
1 page
Chapter 1
No ratings yet
Chapter 1
19 pages
Present Simple 3rd Person
0% (1)
Present Simple 3rd Person
1 page
Definition of The Present Perfect Tense
No ratings yet
Definition of The Present Perfect Tense
4 pages
Grammar Coorect Usage
No ratings yet
Grammar Coorect Usage
4 pages
ECB1 Tests Grammar Check 6.4B
No ratings yet
ECB1 Tests Grammar Check 6.4B
1 page
17Zg Cöfvlk Wbeüb-2022
No ratings yet
17Zg Cöfvlk Wbeüb-2022
16 pages
AHW3e - Level 1 - U01 - GPP
No ratings yet
AHW3e - Level 1 - U01 - GPP
9 pages
Mater Amabilis Level 2, Year 2 Syllabus
No ratings yet
Mater Amabilis Level 2, Year 2 Syllabus
13 pages
Articles Exercises English For Uni
50% (2)
Articles Exercises English For Uni
17 pages
TSLB 3013: Task 2: Presentation
No ratings yet
TSLB 3013: Task 2: Presentation
21 pages
Reported Speech: We Often Have To Give Information About What People "Say" or "Think."
No ratings yet
Reported Speech: We Often Have To Give Information About What People "Say" or "Think."
15 pages
02 - Adjectives - Order - English Grammar Today - Cambridge Dictionary
No ratings yet
02 - Adjectives - Order - English Grammar Today - Cambridge Dictionary
6 pages
LGP Articles
No ratings yet
LGP Articles
24 pages
Gr5.English Teachers Guide JP
No ratings yet
Gr5.English Teachers Guide JP
170 pages
WRITING 101 Task 1 IELTS
No ratings yet
WRITING 101 Task 1 IELTS
10 pages
Sheer Arrogance (Extreme Arrogance) Arrogant Arrogantly
No ratings yet
Sheer Arrogance (Extreme Arrogance) Arrogant Arrogantly
5 pages
003 Affixes
No ratings yet
003 Affixes
6 pages
1 Lexicology As A Branch of Linguistics
No ratings yet
1 Lexicology As A Branch of Linguistics
8 pages
Reported Speech
No ratings yet
Reported Speech
42 pages
Types of Figurative Language: 1. Simile
No ratings yet
Types of Figurative Language: 1. Simile
4 pages
In A Sentence or Text You Have To Change The Form of A Word
No ratings yet
In A Sentence or Text You Have To Change The Form of A Word
2 pages
Writing Task 1 Day 2 Answer Keys
No ratings yet
Writing Task 1 Day 2 Answer Keys
3 pages
English Grammar
No ratings yet
English Grammar
5 pages
5. 맨정신 (Sober)
No ratings yet
5. 맨정신 (Sober)
8 pages
General Aims:: Warming Up and Oral Practice
No ratings yet
General Aims:: Warming Up and Oral Practice
3 pages
Be Been Being
No ratings yet
Be Been Being
2 pages

Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese

Uploaded by

Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese

Uploaded by

Analysis and Evaluation of Stemming Algorithms: A case

Study with Assamese

ABSTRACT It is one of the initial steps in analyzing the morphology

1. A sub-string may not always appear as a suﬃx in word

International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012) 843

length equal to 1 is highest, 56%. It is clear from this ex-

3. root + plural marker + case marker + emphatic marker

844 International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)

Table 2: First 10 most frequent word in the dataset

International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012) 845

846 International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)

You might also like