Word embedding-based sinhala news documents classification

dc.contributor.author	Weerasiri, R.I.
dc.contributor.author	Lorensuhewa, S.A.S.
dc.contributor.author	Kalyani, M.A.L.
dc.date.accessioned	2022-03-24T04:09:03Z
dc.date.available	2022-03-24T04:09:03Z
dc.date.issued	2022-01-19
dc.identifier.issn	1391-8796
dc.identifier.uri	http://ir.lib.ruh.ac.lk/xmlui/handle/iruor/5590
dc.description.abstract	News articles are increasing daily, and a huge number of text documents are added to the Internet. Manual classification of these documents has become an impossible task. In Sinhala news document classification, TF-IDF has been used more often as a word representation, but word embedding has rarely been used. We compared the performance of Word2Vec, Fast Text and Doc2vec with frequently used Term Frequency Inverse Document Frequency (TF-IDF) as word representations for Sinhala news documents classification and applied machine learning approaches for the best word embedding model identified. We also experimented with each representation by removing stop words and investigated the feasibility of using Convolutional Neural Networks (CNN) as well.	en_US
dc.language.iso	en	en_US
dc.publisher	Faculty of Science, University of Ruhuna, Matara, Sri Lanka	en_US
dc.subject	Classification	en_US
dc.subject	Word embedding	en_US
dc.subject	Fast Text	en_US
dc.subject	Sinhala documents	en_US
dc.title	Word embedding-based sinhala news documents classification	en_US
dc.type	Article	en_US