Abstract:
Stemming algorithms are commonly known in a domain of Natural Language
Processing (NLP) and which has positive impact on Information Retrieval (IR) system
and Morphological Analysis. This research presents design, experiment and development
of longest-match based Stemmer for Wolaytta texts. Wolaytta is one of a morphologically
rich language. In Wolaytta, affixation and compounding are the two ways of forming
words. The words that formed in either ways are both derivate and inflect. Since
morphemes that are used to represent a prefix and infix in other languages are all
represented by a suffix in the case of Wolaytta, the size of so me words are very long. To
strip these lengthy suffixes from the word in iterative manner faces computational
complexity and hence, it affects the accuracy of the stemmer. Therefore, to conflate these
inflected and derived variant of words into its stem (stripping affixes from the words and
leaving it with distinct roots) with better accuracy, the new proposed stemmer used
Longest-Match approach. The deep analysis of Wolaytta morphology has taken place and
which helped the researcher how to compile the possible combination of suffixes.
For data preprocess and implementation, C# programming language has been
used. After preprocessing, 12789 unique words are left to experiment this research. Out of
these unique words, 1200 words are randomly selected for testing purpose. Then the
developed stemmer was tested using a method proposed by Paice (counting actual under
and over stemming errors). The output on test dataset has showed 91.84% accuracy with
actual stemmed words. The obtained result shows that the rule based longest match
approach is promising for stemming Wolay tta language texts.