Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese
Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese
∗ † ‡
Navanath Saharia Utpal Sharma Jugal Kalita
Department of CSE Department of CSE Department of CS
Tezpur University Tezpur University University of Colorado
Napaam, India-784028 Napaam, India-784028 Colorado Springs, USA-80918
[email protected] [email protected] [email protected]
842
by linguistic characteristics of Assamese. Next section 4 separately from the root word. For example we found
describes our approach and section 5 give the result and suffix <-samUh> more than 500 times occurring as a
analysis of the approach. Finally, Section 6 concludes our single word in a corpus of size 2.6 million, which is
paper. grammatically wrong. There are a number of other
irregularities specifically in case of hyphenated words.
2. RELATED WORK
3. Sometimes a whole word can appear as a separate suf-
Among the reported work, Porter stemmer [10], an itera-
fix or we can say a suffix can appear as a separate
tive rule based approach is most successful and used widely
word. For example <-dal> is used to mark indefinite
in different applications such as spell-checking and morph
plural marker animated noun, when <dal> appears as
analyzing. Lovin’s approach [5] and minimum description
a separate word it means a kind of aquatic grass or a
length [3] are among other popular stemming algorithms.
group of people.
In the Indian language context a number of (although very 4. There are some suffixes with single characters such as
few) hand-crafted rule based stemmers have been reported, <-r> indicates the genitive case marker, <-t> denotes
whose primary idea is to strip off suffixes. Among these [12] locative case marker these create problems with differ-
used a hand crafted suffix list and stripped off the longest ent types of non-inflectional noun or verb words. For
suffix for Hindi and reported 88% accuracy for their ap- example the word mAnuhar ends with <-r > and is in
proach using a dictionary of size 35,997. The work reported genitive case whereas amar is also ends with <-r >,
by [7] learned suffix stripping rules from a corpus and used but here <-r > is not a case marker, this <-r > is the
clustering to discover the nearest class of the root word for part of the stem.
Bengali, English and French. They described a centroid
based approach that rewards the longest common prefix to 5. In case of some onomatopoeic words or re-duplicative
form similar word clusters based on a threshold value. The hyphenated words both parts may take inflection to
work of [8] focused on some heuristic rules for Hindi and emphasize the word. For example-
reported 89% accuracy. [1] proposed a hybrid form of [7],
[8] for Hindi and Gujarati with precisions of 78% and 83%, dAngare-dAngare kAjiyA karise.
respectively. Their approach took both prefixes as well as TF1 : elder+NOM-elder+NOM quarrel do+PPT
suffixes into account. They used a dictionary and suffix re- ET2 : elders are quarrelling.
placement rules and claimed that the approach is portable
and fast. In this sentence the bold hyphenated token, i.e., dAngar-
dAngar is inflected with a nominative case marker (NOM)
Reported work of [4] for Punjabi used a dictionary of size to emphasize on the ”elder”, which actually caused to
52,000 and found 81.27% accuracy using a brute-force ap- implies a plural sense. In certain cases numbers, for-
proach. [6] described a hybrid method (rule based + suf- eign words and symbols are also inflected depending on
fix stripping + statistical) for Marathi and found 82.50% use. In such situations suffix stripping method fails.
precision for their system. Work in Malayalam [11] used a
dictionary of size 3,000 and reported 90.5% accurate system 6. A root can take more than one suffix in a sequence.
using finite state machines. We have found only [16, 9] as Our study reveals that an Assamese noun root can
example of state of the art of Assamese stemming, where [16] have maximum 30,000 inflected forms in the worst
use unsupervised approach to acquire the language specific case. So our aim is to extract the sequence of suf-
stemming rules. fixes that a root can have. During addition of a suffix
sequence, based on the sound some morphophonemic
changes are made. For example <kar>(kar: to do) is
3. LANGUAGE RELATED ISSUES a verb root ending with a consonant, and any suffix
Assamese is a morphologically rich Indic language spoken
string added after the rootkar that starts with a full
in north-eastern part of India. Among the reported work
vowel (for example <owA>, 2nd person causative verb
on Assamese [15, 16, 13, 14] are the main. Like most Indic
marker) the new inflected form will be karowA. The
language, in Assamese also, only suffixes and prefixes are
full vowel <-o> is changed to vowel modifier. As the
attached to the starting and ending of a word. Though it
new form ends with a vowel sound <-aA>, any ad-
seems simple, there are certain characteristics that compli-
dition after that does not undergo morphophonemic
cate the Assamese stemming process.
change.
Approach -2 7. REFERENCES
Correctly stemmed 82% [1] N. Aswani and R. Gaizauskas. Developing
Incorrectly stemmed 18% morphological analysers for south asian languages:
Stemmed as no inflection 23% Experimenting with the Hindi and Gujarati languages.
Stemmed as single inflection 57 % In Proceedings of the Seventh conference on
Stemmed as multiple inflection 20% International Language Resources and Evaluation
(LREC), 2010.
Table 4: Obtained result using approach-2 [2] W. B. Frakes and C. J. Fox. Strength and similarity of
affix removal stemming algorithms. SIGIR Forum,
37(1):26–30, Apr. 2003.
1 is 0.28 whereas index compression factor for approach 2 is [3] J. Goldsmith. An Algorithm for the Unsupervised
0.31. Learning of Morphology. Natural Language
Engineering, 1, 1998.
Approach -1
[4] D. Kumar and P. Rana. Design and development of a
Words in the test file 123753
stemmer for Punjabi. International Journal of
Unique words before stemming 25111
Computer Applications, 11, 2011.
Unique words after stemming 18012
[5] J. B. Lovins. Development of a stemming algorithm.
Min. word length after stemming 1
Mechanical Translation and Computational
Max. word length after stemming 14
Linguistics, 11, 1968.
Mean number of words 1.11
[6] M. M. Majgaonker and T. J. Siddiqui. Discovering
Mean stemmed word length 4.03
suffixes: A case study for Marathi language.
Index compression factor 0.28 International Journal on Computer Science and
Engineering, 04:2716–2720, 2010.
Table 5: Evaluation of approach-1 using [2] [7] P. Majumder, M. Mitra, S. Parui, G. Kole, P. Mitra,
and K. Datta. YASS: Yet another suffix stripper. ACM
Approach -2 Transanctions and Information Systems, 25, 2007.
Words in the test file 123753 [8] A. Pandey and T. Siddiqui. An unsupervised Hindi
Unique words before stemming 25111 stemmer with heuristic improvements. In In
Unique words after stemming 17218 Proceedings of the second workshop on Analytics for
Min. word length after stemming 1 noisy unstructured text data, page 99âĂŞ105, 2008.
Max. word length after stemming 14 [9] M. Parakh and R. N. Developing morphology analyser
Mean number of words 1.36 four Indian languages using a rule based suffix
Mean stemmed word length 4.55 stripping approach. Language In India, pages 13–15,
Index compression factor 0.31 2011.
[10] M. F. Porter. An algorithm for suffix stripping.
Table 6: Evaluation of approach-2 using [2] Program, 14:130âĂŞ137, 1980.
[11] V. S. Ram and S. L. Devi. Malayalam stemmer. In
6. CONCLUSION Morphological Analysers and Generators, pages
In this report we analyse and evaluate two stemming tech- 105–113, 2010.
niques with a dataset of size 123753 words. In the first [12] A. Ramanathan and D. D. Rao. A lightweight
approach we describe the suffix stripping approach with the stemmer for Hindi. In Proceedings of the 10th
suffix generated by the rule engine, there is always a problem Conference of the European Chapter of the Association
of over-stemming and under-stemming with this approach. for Computational Linguistics (EACL), on
In the second approach we use a frequent root-word with suf- Computatinal Linguistics for South Asian Languages,
fix stripping, which increase the accuracy from 61% to 82%. 2003.
In comparison with [16, 9], the result produced by approach [13] N. Saharia, D. Das, U. Sharma, and J. Kalita. Part of
2 with only 20,000 root-word list look-up is considerable. speech tagger for Assamese text. In Proceedings of the