Abstract:
Erroneous words can be classified into two categories, namely non-word errors and real-word errors. These errors can occur in sentences when typing a document due to fast typing, switching of fingers on keys, input tools and method, or not knowing the right pronunciation, correct spelling or the meaning of the word. A common approach to correcting non-word and real-word errors in Tamil language is proposed in this paper. Erroneous words are detected by considering the appropriateness of the words in the context of the sentence. A bigram probabilistic model is constructed as it is simple and found to be good enough to determine the appropriateness of the valid word in the context of the sentence (than a trigram model). In case of lacking appropriateness, the word is marked as an erroneous word (non-word or real-word error) and word-level trigram technique is used to generate suggestions. In case of finding more than three suggestions, word-level n-gram (unigram, bigram & trigram) language probabilistic model is constructed to determine suggestions appropriate to the context. Test results show that the proposed erroneous word detection and correction system performs well. In our testing with 9170 sentences having 142 non-word errors & 119 real-word errors, bigram probabilistic model detects all of them successfully. The bigram probabilistic model detects non-word as well as real-word errors. For the 261 erroneous words, error correction module gives 583 suggestions, and 569 of 583 suggestions are found to be appropriate to the context. The suggestions produced by the system are checked by a Scholar in Tamil language and found to be 97.6% correct with F1-score of 0.99. This shows that the approach proved to be good for detecting and correcting real-word errors can be used for non-word errors as well