Development of Longest-match based Stemmer for Wolaytta Language

AMU IR Home
→
AMiT
→
Civil Engineering
→
View Item

Development of Longest-match based Stemmer for Wolaytta Language

Girma Yohannis Bade

URI: http://hdl.handle.net/123456789/293

Date: 2014-10

Abstract:

Stemming algorithms are commonly known in a domain of Natural Language Processing (NLP) and which has positive impact on Information Retrieval (IR) system and Morphological Analysis. This research presents design, experiment and development of longest-match based Stemmer for Wolaytta texts. Wolaytta is one of a morphologically rich language. In Wolaytta, affixation and compounding are the two ways of forming words. The words that formed in either ways are both derivate and inflect. Since morphemes that are used to represent a prefix and infix in other languages are all represented by a suffix in the case of Wolaytta, the size of so me words are very long. To strip these lengthy suffixes from the word in iterative manner faces computational complexity and hence, it affects the accuracy of the stemmer. Therefore, to conflate these inflected and derived variant of words into its stem (stripping affixes from the words and leaving it with distinct roots) with better accuracy, the new proposed stemmer used Longest-Match approach. The deep analysis of Wolaytta morphology has taken place and which helped the researcher how to compile the possible combination of suffixes. For data preprocess and implementation, C# programming language has been used. After preprocessing, 12789 unique words are left to experiment this research. Out of these unique words, 1200 words are randomly selected for testing purpose. Then the developed stemmer was tested using a method proposed by Paice (counting actual under and over stemming errors). The output on test dataset has showed 91.84% accuracy with actual stemmed words. The obtained result shows that the rule based longest match approach is promising for stemming Wolay tta language texts.

Show full item record