Word embedding-based sinhala news documents classification

Weerasiri, R.I.; Lorensuhewa, S.A.S.; Kalyani, M.A.L.

IRUOR Home
→
Scholarly Publications
→
Conference and Symposia Proceedings
→
Ruhuna International Science and Technology Conference
→
RISTCON 2022
→
View Item

Word embedding-based sinhala news documents classification

Weerasiri, R.I.; Lorensuhewa, S.A.S.; Kalyani, M.A.L.

URI: http://ir.lib.ruh.ac.lk/xmlui/handle/iruor/5590

Date: 2022-01-19

Abstract:

News articles are increasing daily, and a huge number of text documents are added to the Internet. Manual classification of these documents has become an impossible task. In Sinhala news document classification, TF-IDF has been used more often as a word representation, but word embedding has rarely been used. We compared the performance of Word2Vec, Fast Text and Doc2vec with frequently used Term Frequency Inverse Document Frequency (TF-IDF) as word representations for Sinhala news documents classification and applied machine learning approaches for the best word embedding model identified. We also experimented with each representation by removing stop words and investigated the feasibility of using Convolutional Neural Networks (CNN) as well.

Show full item record