Sinhala news analysis using text mining and machine learning

Ekanayaka, R.K.S.K.; Lorensuhewa, S.A.S.; Kalyani, M.A.L.

IRUOR Home
→
Scholarly Publications
→
Conference and Symposia Proceedings
→
Ruhuna International Science and Technology Conference
→
RISTCON 2018
→
View Item

dc.contributor.author	Ekanayaka, R.K.S.K.
dc.contributor.author	Lorensuhewa, S.A.S.
dc.contributor.author	Kalyani, M.A.L.
dc.date.accessioned	2023-02-03T03:13:55Z
dc.date.available	2023-02-03T03:13:55Z
dc.date.issued	2018-02-15
dc.identifier.issn	1391-8796
dc.identifier.uri	http://ir.lib.ruh.ac.lk/xmlui/handle/iruor/10725
dc.description.abstract	Due to the rapid development of information technology, vast amounts of information are generated daily. Unstructured data such as news reports are a significant part of these growing information repositories. This study focuses on analyzing Sinhala news reports published online to extract important features using text mining and machine learning techniques. Then, represent this extracted information in a way that news readers find it easy to read news or do research on past news reports. For a morphologically rich complex language like Sinhala, it makes text mining a difficult task. In our approach, we first pre processed dataset with filtering, stop w ord removal, stemming and then experimented with feature selection methods such as n gram combinations, count vectorizer and TF IDF vectorizer. Text classification methods such as Naive Bayes, Support Vector Machines, Decision Trees, K means and hierarchic al clustering methods were evaluated. Later, we represented the mined knowledge using information visualization methods such as charts, tag clouds and tree structures. Unigram features with TF IDF vectorizer for feature selection, Naïve Bayes for document classification and K means for clustering were the most accurate techniques for Sinhala news. The accuracy of the information visualization methods was measured with human experts. Our results reveal that language specific text pre processing and feature selection increases the efficiency of information retrieval tasks when compared to generally used existing methods and the new representation model saves users’ time and effort to find news reports based on their preferences rather than going through exist ing news websites.	en_US
dc.language.iso	en	en_US
dc.publisher	Faculty of Science, University of Ruhuna, Matara, Sri Lanka	en_US
dc.subject	Sinhala language	en_US
dc.subject	Feature selection	en_US
dc.subject	Text classification	en_US
dc.subject	Text clustering	en_US
dc.subject	Information visualization	en_US
dc.title	Sinhala news analysis using text mining and machine learning	en_US
dc.type	Article	en_US