Abstract:
With the intention to develop a suitable approach to perform Sentiment
Analysis on Tamil Texts using K-means clustering with k-Nearest Neighbour
(k-NN) classifier, a corpus (UJ_Corpus_Opinions) consisting of 1518
Positive and 1173 Negative comments has been constructed. For training and
testing 820 and 650 positive and 820 and 350 negative comments were
considered, respectively.
Bag of Words (BoW) and fastText vectors were used to create feature
vectors. These feature vectors were clustered using K-means clustering. The
cluster centroids were used as classification keys for k-NN classifier. Two
types of clustering techniques were utilised to develop two models: (i) using
class-wise information, (ii) with no class-wise information. These two models
were tested using K-Fold. All these four models were tested with the two
types of feature vectors.
These models were tested using varying number of centroids (Kc:1..10),
neighbours (Kn:1..Kc) and folds (Kf:1..10) to study their influence in the
accuracy. The accuracy increases with the values of Kc, and the highest
accuracy (74%) was obtained for Kn=1 and Kf=2. Accuracy, in general, was
found to be more with fastText than with the BoW. It was noted that the
model with fastText and class-wise clustering with K-Fold that obtained 74%
accuracy has F1-Score of 0.74.