Development of Longest-match based Stemmer for Wolaytta Language

Show simple item record

dc.contributor.author Girma Yohannis Bade
dc.date.accessioned 2016-05-23T12:34:13Z
dc.date.available 2016-05-23T12:34:13Z
dc.date.issued 2014-10
dc.identifier.uri http://hdl.handle.net/123456789/293
dc.description.abstract Stemming algorithms are commonly known in a domain of Natural Language Processing (NLP) and which has positive impact on Information Retrieval (IR) system and Morphological Analysis. This research presents design, experiment and development of longest-match based Stemmer for Wolaytta texts. Wolaytta is one of a morphologically rich language. In Wolaytta, affixation and compounding are the two ways of forming words. The words that formed in either ways are both derivate and inflect. Since morphemes that are used to represent a prefix and infix in other languages are all represented by a suffix in the case of Wolaytta, the size of so me words are very long. To strip these lengthy suffixes from the word in iterative manner faces computational complexity and hence, it affects the accuracy of the stemmer. Therefore, to conflate these inflected and derived variant of words into its stem (stripping affixes from the words and leaving it with distinct roots) with better accuracy, the new proposed stemmer used Longest-Match approach. The deep analysis of Wolaytta morphology has taken place and which helped the researcher how to compile the possible combination of suffixes. For data preprocess and implementation, C# programming language has been used. After preprocessing, 12789 unique words are left to experiment this research. Out of these unique words, 1200 words are randomly selected for testing purpose. Then the developed stemmer was tested using a method proposed by Paice (counting actual under and over stemming errors). The output on test dataset has showed 91.84% accuracy with actual stemmed words. The obtained result shows that the rule based longest match approach is promising for stemming Wolay tta language texts. en_US
dc.language.iso en en_US
dc.publisher ARBA MINCH UNIVERSITY en_US
dc.subject Natural Language Processing, Morphology, Longest-match. en_US
dc.title Development of Longest-match based Stemmer for Wolaytta Language en_US
dc.type Thesis en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search AMU IR


Advanced Search

Browse

My Account