Sinhala news analysis using text mining and machine learning

Show simple item record

dc.contributor.author Ekanayaka, R.K.S.K.
dc.contributor.author Lorensuhewa, S.A.S.
dc.contributor.author Kalyani, M.A.L.
dc.date.accessioned 2023-02-03T03:13:55Z
dc.date.available 2023-02-03T03:13:55Z
dc.date.issued 2018-02-15
dc.identifier.issn 1391-8796
dc.identifier.uri http://ir.lib.ruh.ac.lk/xmlui/handle/iruor/10725
dc.description.abstract Due to the rapid development of information technology, vast amounts of information are generated daily. Unstructured data such as news reports are a significant part of these growing information repositories. This study focuses on analyzing Sinhala news reports published online to extract important features using text mining and machine learning techniques. Then, represent this extracted information in a way that news readers find it easy to read news or do research on past news reports. For a morphologically rich complex language like Sinhala, it makes text mining a difficult task. In our approach, we first pre processed dataset with filtering, stop w ord removal, stemming and then experimented with feature selection methods such as n gram combinations, count vectorizer and TF IDF vectorizer. Text classification methods such as Naive Bayes, Support Vector Machines, Decision Trees, K means and hierarchic al clustering methods were evaluated. Later, we represented the mined knowledge using information visualization methods such as charts, tag clouds and tree structures. Unigram features with TF IDF vectorizer for feature selection, Naïve Bayes for document classification and K means for clustering were the most accurate techniques for Sinhala news. The accuracy of the information visualization methods was measured with human experts. Our results reveal that language specific text pre processing and feature selection increases the efficiency of information retrieval tasks when compared to generally used existing methods and the new representation model saves users’ time and effort to find news reports based on their preferences rather than going through exist ing news websites. en_US
dc.language.iso en en_US
dc.publisher Faculty of Science, University of Ruhuna, Matara, Sri Lanka en_US
dc.subject Sinhala language en_US
dc.subject Feature selection en_US
dc.subject Text classification en_US
dc.subject Text clustering en_US
dc.subject Information visualization en_US
dc.title Sinhala news analysis using text mining and machine learning en_US
dc.type Article en_US


Files in this item

This item appears in the following Collection(s)

Show simple item record

Search DSpace


Browse

My Account