A p re liminar y model of predictive t ext for Sinhala using N gram Statistics

Chanaka, K.M.R.; Lorensuhewa, S.A.S.; Kalyani, M.A.L.

IRUOR Home
→
Scholarly Publications
→
Conference and Symposia Proceedings
→
Ruhuna International Science and Technology Conference
→
RISTCON 2017
→
View Item

dc.contributor.author	Chanaka, K.M.R.
dc.contributor.author	Lorensuhewa, S.A.S.
dc.contributor.author	Kalyani, M.A.L.
dc.date.accessioned	2023-01-30T08:48:00Z
dc.date.available	2023-01-30T08:48:00Z
dc.date.issued	2017-01-26
dc.identifier.issn	1391-8796
dc.identifier.uri	http://ir.lib.ruh.ac.lk/xmlui/handle/iruor/10495
dc.description.abstract	Most Sri Lankans use Sinhala text processing in their day to day activities. But, they feel it hard to type documents in Sinhala and also it takes more time and involves typing mistakes and therefore efficiency is low. Integration of word prediction facility helps the user to select words rather than typing the word s repeatedly to reduce the number of required keystrokes, minimize mistakes and reduce time. The aim of this research is to explore the use of Natural Language Processing and Machine Learning techniques to assist Sinhala typing tasks by predicting the word s. We predict the next word to type from n gram probabilistic model which involves bi gram, tri gram and a combi nation of bi gram and tr i gram. This composite n gram model includes both bi gram and tri gram, giving high priority to the tri gram suggestions . The n gram corpus is generated from Sinhala corpus collected from online Sinhala newspapers. A maximum prediction percentage of 41 was achieved for sports documents by using domain specific n gram corpus of sports documents and obtained an 18.1% average keystroke reduction by using the prediction model. We tested with other news categories such as political, legal and local collected from local newspapers as well. According to our experimental results, composite n gram model outperformed bi gram and tri g ram word prediction models and the domain specific composite n gram model performs better than the composite model created from a mixed corpus. Our goal is to automatically cluster the document corpus and classify the edited text after entering certain amo unt of text and get the predictions from a relevant cluster dynamically to improve the accuracy at runtime, giving a more relevant prediction.	en_US
dc.language.iso	en	en_US
dc.publisher	Faculty of Science, University of Ruhuna, Matara, Sri Lanka	en_US
dc.subject	Word Prediction	en_US
dc.subject	Dynamic Text Prediction	en_US
dc.subject	N - Gram Model	en_US
dc.subject	NLP	en_US
dc.subject	Text Mining	en_US
dc.title	A p re liminar y model of predictive t ext for Sinhala using N gram Statistics	en_US
dc.type	Article	en_US