Abstract:
News articles are increasing daily, and a huge number of text documents are added to the Internet. Manual classification of these documents has become an impossible task. In Sinhala news document classification, TF-IDF has been used more often as a word representation, but word embedding has rarely been used. We compared the performance of Word2Vec, Fast Text and Doc2vec with frequently used Term Frequency Inverse Document Frequency (TF-IDF) as word representations for Sinhala news documents classification and applied machine learning approaches for the best word embedding model identified. We also experimented with each representation by removing stop words and investigated the feasibility of using Convolutional Neural Networks (CNN) as well.