Abstract:
To manipulate, analyze and process human language in a computer, it must be organized and
structured in a way it understands. Part-of-speech (POS) tagging is one of the Natural
Language Processing (NLP) applications. It is a task of labeling words with their appropriate
Part-of-Speech tags. As far as the researcher's knowledge there is no POS tagging research
conducted for Gamotstso language. But for local languages like Amharic, Afaan Oromoo,
Wolayta, Kafi-noonoo, Tigrigna, and other languages a part-of-speech tagger is developed
using different approaches. In this study, a part-of-speech tagger for Gamotstso language has
been developed using a Hidden Markov Model and rule-based approach. The Viterbi
algorithm for Hidden Markov Model and brill transformation-based error-driven learning for
the rule-based approach was used with slight modifications in their modules based on the
nature of the language. Natural Language Toolkit version 3.4.5 and Python 2.7 were used to
implement the tagger model and conduct experimental analysis. Discussion with linguists
and review on different works of literature were made to understand the morphological and
grammatical structure of the language and to identify possible tagsets for the study. As a
result, 25 tagsets were identified. 1346 sentences which are composed of 25,512 words with
6919 unique words are collected from news of FM 90.9 radio station and Gamotstso New
Testament bible. The collected corpus has been split into training and testing corpus. Hence
80% of the corpus is used to train the tagger model and the remaining 20% is to test the
performance of the tagger model. Both the Hidden Markov Model and rule-based taggers
were trained and tested on the same data. As a result, Hidden Markov Model taggers:
unigram, bigram, and trigram taggers achieved an accuracy of 89.6%, 90.6%, and 91.5%
respectively and the rule-based taggers which use unigram, bigram, and trigram taggers as
initial stage taggers achieved an accuracy of 91%, 91.5%, and 92.2% respectively. As shown
in the performance analysis result that the rule-based taggers outperform the Hidden Markov
Model taggers. To improve the performance of the taggers pre-prepared standard balanced
corpus and standard tagsets were recommended for future work