An improved kNN algorithm using K-means and fastText to predict sentiments expressed in Tamil texts

Thavareesan, S.; Mahesan, S.

IRUOR Home
→
Scholarly Publications
→
Conference and Symposia Proceedings
→
Ruhuna International Science and Technology Conference
→
RISTCON 2020
→
View Item

An improved kNN algorithm using K-means and fastText to predict sentiments expressed in Tamil texts

Thavareesan, S.; Mahesan, S.

URI: http://ir.lib.ruh.ac.lk/xmlui/handle/iruor/11504

Date: 2020-01-22

Abstract:

With the intention to develop a suitable approach to perform Sentiment Analysis on Tamil Texts using K-means clustering with k-Nearest Neighbour (k-NN) classifier, a corpus (UJ_Corpus_Opinions) consisting of 1518 Positive and 1173 Negative comments has been constructed. For training and testing 820 and 650 positive and 820 and 350 negative comments were considered, respectively. Bag of Words (BoW) and fastText vectors were used to create feature vectors. These feature vectors were clustered using K-means clustering. The cluster centroids were used as classification keys for k-NN classifier. Two types of clustering techniques were utilised to develop two models: (i) using class-wise information, (ii) with no class-wise information. These two models were tested using K-Fold. All these four models were tested with the two types of feature vectors. These models were tested using varying number of centroids (Kc:1..10), neighbours (Kn:1..Kc) and folds (Kf:1..10) to study their influence in the accuracy. The accuracy increases with the values of Kc, and the highest accuracy (74%) was obtained for Kn=1 and Kf=2. Accuracy, in general, was found to be more with fastText than with the BoW. It was noted that the model with fastText and class-wise clustering with K-Fold that obtained 74% accuracy has F1-Score of 0.74.

Show full item record