0% found this document useful (0 votes)
15 views

Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese

Uploaded by

rickshark
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Analysis and Evaluation of Stemming Algorithms A Case Study With Assamese

Uploaded by

rickshark
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Analysis and Evaluation of Stemming Algorithms: A case

Study with Assamese

∗ † ‡
Navanath Saharia Utpal Sharma Jugal Kalita
Department of CSE Department of CSE Department of CS
Tezpur University Tezpur University University of Colorado
Napaam, India-784028 Napaam, India-784028 Colorado Springs, USA-80918
[email protected] [email protected] [email protected]

ABSTRACT It is one of the initial steps in analyzing the morphology


Stemming is the process of automatically extracting the of words. A number of approaches have been reported by
base form of a given word of a language. Assamese is a researchers for different language groups. The reported ap-
morphologically rich, relatively free word order, Indo-Aryan proaches can be classified into two board categories. Rule
language spoken in North-Eastern part of India that uses based technique[10, 14], unsupervised technique[7, 16]. The
Assamese-Bengali script for writing. As it is among the Rule based approach produces the best result either rela-
less computationally studied languages, our aim is to ex- tively fixed word order or less inflectional languages. But
tract stem from a given word. We adopt the suffix stripping for highly inflectional language, we need more complete lin-
approach along with a rule engine that generates all the guistic knowledge of the formation of words, to handle more
possible suffix sequences. We found 82% accuracy with the complex derivational and inflectional morphology. To ex-
suffix stripping approach after adding a root-word list of size tract the root from an inflected word, we should have de-
20,000 approximately. tailed knowledge of word formation of that language. For
highly inflectional languages some probable ways of forming
Keywords words are listed below.
Stemming, Inflectional language, Assamese
1. Combining more than one root word.
1. INTRODUCTION
2. Partially combining more than one root word.
In any information retrieval system (IR), one of the first
tasks is to extract and index the information available in the 3. Adding tense, aspect or mood markers.
document in the form of words or terms. As most Indian
languages are highly inflectional (as a result morphologically 4. Word-to-word translation from another language.
rich), they are adversely affected by the abundance of words
5. Partially combining more than one root word from two
appearing in various morphological forms. Indexing all the
languages.
words is tedious as well as uninformative for further process-
ing. To reduce the morphological variation in indexing the
most common method is to represent the words in a normal- After passing through any of the steps mentioned above, a
ized representative form. One such approach is the process root word has changed either in terms of morphology, syntax
of finding the stem or root. Though linguistically root and or semantics. An unsupervised approach needs linguistics
stem has a different meaning, in this report we alternately resource including corpus. Thus an unsupervised approach
use both words to mean stem word from an inflected form. may not produce better results for resource poor languages
∗Navanath Saharia is currently pursuing his Ph.D. in Com- like most Indian languages including Assamese. There are
puter Science and Engineering, Tezpur University, India. also a number of published works that combine both the
†Utpal Sharma is working as an Associate Professor in the above mentioned approach, these we refer as hybrid.
Department of Computer Science and Engineering, Tezpur
University, India. Assamese is one of the major languages in the north-eastern
‡Jugal Kalita is a Professor in the Department of Computer parts of India with approximately 30 million native speak-
Science, Colorado University at Colorado Springs, USA. ers. In this study, we take into consideration the problem of
stemming Assamese texts in which stemming is particularly
hard due to the common appearance of single letter suffixes
Permission to make digital or hard copies of all or part of this work for as morphological inflections. In our study we observed that
personal or classroom use is granted without fee provided that copies are more than 50% of the inflections in the Assamese language
not made or distributed for profit or commercial advantage and that copies appear as single letter suffixes. Such single letter morpho-
bear this notice and the full citation on the first page. To copy otherwise, logical inflections cause ambiguity while predicting the un-
or republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee.
derlying root word.
ICACCI’12, August 3-5, 2012, Chennai, T Nadu, India.
Copyright 2012 ACM 978-1-4503-1196-0/12/08…$10.00. The rest of the paper is organized as follows. In section
2 we describe previous work related to stemming followed

842
by linguistic characteristics of Assamese. Next section 4 separately from the root word. For example we found
describes our approach and section 5 give the result and suffix <-samUh> more than 500 times occurring as a
analysis of the approach. Finally, Section 6 concludes our single word in a corpus of size 2.6 million, which is
paper. grammatically wrong. There are a number of other
irregularities specifically in case of hyphenated words.
2. RELATED WORK
3. Sometimes a whole word can appear as a separate suf-
Among the reported work, Porter stemmer [10], an itera-
fix or we can say a suffix can appear as a separate
tive rule based approach is most successful and used widely
word. For example <-dal> is used to mark indefinite
in different applications such as spell-checking and morph
plural marker animated noun, when <dal> appears as
analyzing. Lovin’s approach [5] and minimum description
a separate word it means a kind of aquatic grass or a
length [3] are among other popular stemming algorithms.
group of people.
In the Indian language context a number of (although very 4. There are some suffixes with single characters such as
few) hand-crafted rule based stemmers have been reported, <-r> indicates the genitive case marker, <-t> denotes
whose primary idea is to strip off suffixes. Among these [12] locative case marker these create problems with differ-
used a hand crafted suffix list and stripped off the longest ent types of non-inflectional noun or verb words. For
suffix for Hindi and reported 88% accuracy for their ap- example the word mAnuhar ends with <-r > and is in
proach using a dictionary of size 35,997. The work reported genitive case whereas amar is also ends with <-r >,
by [7] learned suffix stripping rules from a corpus and used but here <-r > is not a case marker, this <-r > is the
clustering to discover the nearest class of the root word for part of the stem.
Bengali, English and French. They described a centroid
based approach that rewards the longest common prefix to 5. In case of some onomatopoeic words or re-duplicative
form similar word clusters based on a threshold value. The hyphenated words both parts may take inflection to
work of [8] focused on some heuristic rules for Hindi and emphasize the word. For example-
reported 89% accuracy. [1] proposed a hybrid form of [7],
[8] for Hindi and Gujarati with precisions of 78% and 83%, dAngare-dAngare kAjiyA karise.
respectively. Their approach took both prefixes as well as TF1 : elder+NOM-elder+NOM quarrel do+PPT
suffixes into account. They used a dictionary and suffix re- ET2 : elders are quarrelling.
placement rules and claimed that the approach is portable
and fast. In this sentence the bold hyphenated token, i.e., dAngar-
dAngar is inflected with a nominative case marker (NOM)
Reported work of [4] for Punjabi used a dictionary of size to emphasize on the ”elder”, which actually caused to
52,000 and found 81.27% accuracy using a brute-force ap- implies a plural sense. In certain cases numbers, for-
proach. [6] described a hybrid method (rule based + suf- eign words and symbols are also inflected depending on
fix stripping + statistical) for Marathi and found 82.50% use. In such situations suffix stripping method fails.
precision for their system. Work in Malayalam [11] used a
dictionary of size 3,000 and reported 90.5% accurate system 6. A root can take more than one suffix in a sequence.
using finite state machines. We have found only [16, 9] as Our study reveals that an Assamese noun root can
example of state of the art of Assamese stemming, where [16] have maximum 30,000 inflected forms in the worst
use unsupervised approach to acquire the language specific case. So our aim is to extract the sequence of suf-
stemming rules. fixes that a root can have. During addition of a suffix
sequence, based on the sound some morphophonemic
changes are made. For example <kar>(kar: to do) is
3. LANGUAGE RELATED ISSUES a verb root ending with a consonant, and any suffix
Assamese is a morphologically rich Indic language spoken
string added after the rootkar that starts with a full
in north-eastern part of India. Among the reported work
vowel (for example <owA>, 2nd person causative verb
on Assamese [15, 16, 13, 14] are the main. Like most Indic
marker) the new inflected form will be karowA. The
language, in Assamese also, only suffixes and prefixes are
full vowel <-o> is changed to vowel modifier. As the
attached to the starting and ending of a word. Though it
new form ends with a vowel sound <-aA>, any ad-
seems simple, there are certain characteristics that compli-
dition after that does not undergo morphophonemic
cate the Assamese stemming process.
change.

1. A sub-string may not always appear as a suffix in word


karowaiCil = kar+owA + iCil
construction of Assamese. For example <-bor> and <-
mAn> denotes indefinite plural marker but in case of = do + 2nd person causative verb marker
ketbor, kisumAn we can not extract <-bor> and <- + Past perfect tense marker.
mAn> as suffix. Likewise <-jan> is used to mark sin-
gular number masculine gender animated noun. But Here the bold faced o is vowel modifier and the bold
it is not the case with the word prayojan, we can not faced i is full vowel. Thus identifying this type of in-
extract <-jan> as a singular number masculine gender flection is very tedious or sometime impossible using
animated noun denoted suffix. the simple suffix stripping method.
1
2. The corpus we use to process has a number of mis- Transliterate form
2
spelled words and sometimes some suffixes are written Approximate English Translation

International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012) 843


4. OUR APPROACH Algorithm 1 Stemming Approach-1
In this study we use a part of EMILLE3 corpus of size 123753 1: Read a line from the corpus file.
words. Table 1 provides basic the statistics of the dataset 2: Extract words (from this point we called it as token)
used and Table 2 shows the top most frequent words along from the line, clean the token, that is remove punctua-
with part of speech tag and frequency from the. For this tion marker attached with token if there is one.
experiment we consider the suffix stripping approach to find 3: Look up suffix-list generated manually from the end of
stems in Assamese. We found that the suffix stripping ap- the token. If matched with the suffix-list extract and
proach can stem words with accuracy 61%. After that we exit.
added a root-word list of size approximately 20,000 in the 4: Go to step 1 until the end of the corpus.
second approach.

length equal to 1 is highest, 56%. It is clear from this ex-


Number of assertive sentences 7689 periment that the error rate will decrease with increasing
Number of interrogative sentences 218 suffix length. As the error rate of single character suffix is
Number of exclamatory sentences 93 highest, one possible solution to increase the accuracy is to
Total number of words 123753 add a root-word list, which is discussed in section 4.2.
Number of words in the longest sentence 112
Number of words in the shortest sentence 1 4.2 Approach-2
Unique words in the dataset 25111 With the existing rule-set or knowledge base we were not
Mean word length 5.85 able to classify all the words of language into a well-defined
category. Each language has such words that cause trouble.
Table 1: Statistics of the test dataset To handle this type of exception we maintain a word-list,
where most frequent and exceptional root words are stored.
4.1 Approach-1 Thus in this approach first the word-list is checked against
As mentioned above an Assamese root can take a series of each word to be stemmed after that we apply algorithm 1.
suffixes sequentially , so our first aim is to find the proba- The main advantage of the approach is that it minimizes
ble sequence of suffixes that a root has. We manually col- the over-stemming and under-stemming error. We give the
lected all possible suffixes and categorised them into four steps mentioned in 2
basic groups, viz., case marker, plural marker, classifier and
emphatic marker. After that the rule engine generates a Algorithm 2 Stemming Approach-2
list of suffixes. The rule engine is a module that generates 1: Read a line from the corpus file.
all probable suffix sequences that may be attached after a 2: Extract words (from this point we called it as token)
root, based on the affixation rule of Assamese. For example from the line, clean the token, that is remove punctua-
an Assamese noun root may take the following sequence of tion marker attached with token if there is one.
suffixes - 3: Check the dictionary. If a dictionary entry matches with
the token, mark token as root word and exit otherwise
execute the next step.
1. root + plural marker (PL)
4: Look up suffix-list generated manually from the end of
Example: mAnuhbor = mAnuh + bor
the token. If there is a match with the suffix-list extract
2. root + case marker (CM) and exit.
Example: mAnuhar = mAnuh + r 5: Go to step 1 until the end of the corpus.

3. root + plural marker + case marker + emphatic marker


(EM) 5. EXPERIMENTAL RESULT
Example: mAnuhborarhe = mAnuh + bor + r + he) The accuracy found using approach 1 is 61%, and using ap-
etc.. proach 2 is 82%. The complete statistics of our experiment
is tabulated in Table 3 and Table 4. In comparison with
[16, 9] the result produced by approach 2 with only 20,000
We confirmed 14 such rules and generated approximately root-word list look-up is considerably better.
18,194 suffix sequences for Assamese. The next phase was to
extract the longest possible sub-string from the given word Approach -1
using a suffix-list look-up. Among the generated suffixes, 11 Correctly stemmed 61%
suffixes were single word suffix and this single word suffixes Incorrectly stemmed 39%
caused rapid the down-fall of the accuracy, as they created Stemmed as no inflection 24%
the problem mentioned in section 3. The pseudo-code for Stemmed as single character inflection 56%
this approach given in 1.
Stemmed as multiple inflection 20%
After this experiment we found that, 61% words are cor-
Table 3: Obtained result using approach-1
rectly stemmed. It is observed that the error rate for in-
flected word with suffix length greater than 4 is less that
We evaluate the strength of approach 1 and approach 2 using
1%, whereas the error rate for inflected words with suffix
[2]. Table 5 and Table 6 shows the evaluation result of the
3
http://www.emille.lancs.ac.uk/ both approaches. The index compression factor for approach

844 International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)


Sl. Assamese word Approximate Translation POS tag Frequency is inflected?
1 aAru and Particle 2249 No
2 mai me Pronoun 1105 No
3 ai this/it Demonstrative 1087 No
4 mor me + Genitive marker pronoun 1063 Yes
5 teon he Pronoun 765 No
6 aAsil be + Past tense Verb 751 Yes
7 teonr he + Genitive case marker pronoun 641 Yes
8 kari do + Present Participle Verb 640 Yes
9 cei that Demonstrative 639 No
10 kintu but Particle 583 No

Table 2: First 10 most frequent word in the dataset

Approach -2 7. REFERENCES
Correctly stemmed 82% [1] N. Aswani and R. Gaizauskas. Developing
Incorrectly stemmed 18% morphological analysers for south asian languages:
Stemmed as no inflection 23% Experimenting with the Hindi and Gujarati languages.
Stemmed as single inflection 57 % In Proceedings of the Seventh conference on
Stemmed as multiple inflection 20% International Language Resources and Evaluation
(LREC), 2010.
Table 4: Obtained result using approach-2 [2] W. B. Frakes and C. J. Fox. Strength and similarity of
affix removal stemming algorithms. SIGIR Forum,
37(1):26–30, Apr. 2003.
1 is 0.28 whereas index compression factor for approach 2 is [3] J. Goldsmith. An Algorithm for the Unsupervised
0.31. Learning of Morphology. Natural Language
Engineering, 1, 1998.
Approach -1
[4] D. Kumar and P. Rana. Design and development of a
Words in the test file 123753
stemmer for Punjabi. International Journal of
Unique words before stemming 25111
Computer Applications, 11, 2011.
Unique words after stemming 18012
[5] J. B. Lovins. Development of a stemming algorithm.
Min. word length after stemming 1
Mechanical Translation and Computational
Max. word length after stemming 14
Linguistics, 11, 1968.
Mean number of words 1.11
[6] M. M. Majgaonker and T. J. Siddiqui. Discovering
Mean stemmed word length 4.03
suffixes: A case study for Marathi language.
Index compression factor 0.28 International Journal on Computer Science and
Engineering, 04:2716–2720, 2010.
Table 5: Evaluation of approach-1 using [2] [7] P. Majumder, M. Mitra, S. Parui, G. Kole, P. Mitra,
and K. Datta. YASS: Yet another suffix stripper. ACM
Approach -2 Transanctions and Information Systems, 25, 2007.
Words in the test file 123753 [8] A. Pandey and T. Siddiqui. An unsupervised Hindi
Unique words before stemming 25111 stemmer with heuristic improvements. In In
Unique words after stemming 17218 Proceedings of the second workshop on Analytics for
Min. word length after stemming 1 noisy unstructured text data, page 99âĂŞ105, 2008.
Max. word length after stemming 14 [9] M. Parakh and R. N. Developing morphology analyser
Mean number of words 1.36 four Indian languages using a rule based suffix
Mean stemmed word length 4.55 stripping approach. Language In India, pages 13–15,
Index compression factor 0.31 2011.
[10] M. F. Porter. An algorithm for suffix stripping.
Table 6: Evaluation of approach-2 using [2] Program, 14:130âĂŞ137, 1980.
[11] V. S. Ram and S. L. Devi. Malayalam stemmer. In
6. CONCLUSION Morphological Analysers and Generators, pages
In this report we analyse and evaluate two stemming tech- 105–113, 2010.
niques with a dataset of size 123753 words. In the first [12] A. Ramanathan and D. D. Rao. A lightweight
approach we describe the suffix stripping approach with the stemmer for Hindi. In Proceedings of the 10th
suffix generated by the rule engine, there is always a problem Conference of the European Chapter of the Association
of over-stemming and under-stemming with this approach. for Computational Linguistics (EACL), on
In the second approach we use a frequent root-word with suf- Computatinal Linguistics for South Asian Languages,
fix stripping, which increase the accuracy from 61% to 82%. 2003.
In comparison with [16, 9], the result produced by approach [13] N. Saharia, D. Das, U. Sharma, and J. Kalita. Part of
2 with only 20,000 root-word list look-up is considerable. speech tagger for Assamese text. In Proceedings of the

International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012) 845


ACL-IJCNLP 2009 Conference Short Papers, pages
33–36, Suntec, Singapore, 2009.
[14] N. Saharia, U. Sharma, and J. Kalita. A suffix-based
noun and verb classifier for an inflectional language. In
International Conference on Asian Language
Processing, pages 19–22. IEEE Computer Society,
2010.
[15] U. Sharma, J. Kalita, and R. Das. Root word
stemming by multiple evidence from corpus. In
Proceedings of 6th International Conference on
Computational Intelligence and Natural Computing
(CINC-2003), 2003.
[16] U. Sharma, J. K. Kalita, and R. K. Das. Acquisition
of Morphology of an Indic Language from Text
Corpus. ACM Transactions on Asian Language
Information Processing (TALIP), 7, 2008.

846 International Conference on Advances in Computing, Communications and Informatics (ICACCI-2012)

You might also like