Abstract:
The rapid growth of large volumes of data on the web has made it increasingly challenging to
extract relevant information efficiently. To address this issue, numerous information extraction
tasks have been explored in the literature. One such task is information extraction for the
Gamotho language, which aims to identify key information from large text collections, organize
it chronologically, and answer questions about what happened in a specific situation and when it
occurred. Unlike other information extraction tasks, such as entity extraction, there is a notable
research gap in text information extraction (IE) for the Gamotho language. To date, no work has
been conducted in this specific area. As the first comprehensive effort in this field, the researcher
designed a model for extracting information from Gamotho texts. The model consists of several
components, including general preprocessing, learning and classification, and Gamotho language
information extraction. To develop the proposed model, different approaches were employed for
each task. For the Gamotho language information extraction component, the researcher utilized a
machine learning classifier that leverages syntactic features such as part-of-speech (POS)
tagging, morphological analysis, and gazetteer lists. In practice, relying solely on a single
information extraction method is challenging due to the limited availability of annotated or
labeled data and the lack of linguistic resources. To overcome these limitations the strengths of
machine learning approaches, the researcher developed a machine learning approach for
Gamotho text information extraction. The researcher conducted various experiments for
information extraction algorithms, using the Bi-LSTM with CRF, Support Vector Machines
(SVM) and BERT (fine-tuned) for named entity recognition, relationship and event extraction
for Gamoththo text. The researcher conducts all the experiments using the most commonly used
method of training option percentage split, 70% is done. This means that out of a total of 600
sentence datasets, 70% (420) is for training and the rest 30% (180) are given for testing in the
experiment. The overall performance of Bi-LSTM+CRF model with the training and testing set
NER scored respectively is ( 80.5%and 80.9%); Relationship scored respectively is ( 85.7%, and
84.3%) and Event scored respectively is ( 80.2%, and 79.5%).