Dictionary based approach

My first attempt to develop a spellchecker for Malayalam was in 2007. I was using hunspell and a word list based approach. It was not successful because of rich morphology of Malayalam. Even though I prepared a manually curated 150K words list, it was nowhere near to cover practically infinite words of Malayalam. For languages with productive morphological processes in compounding and derivation that are capable of generating dictionaries of infinite length, a morphology analysis and generation system is required.

The problems with hunspell based spellchecker and Malayalam

Hunspell has a limited compounding support, but limited to two levels. Malayalam can have more than 2 level compounding and sometimes the agglutinated words is also inflected. Hunspell system has an affix dictionary and suffix mapping system. But it is very limited to support complex morphology like Malayalam. With the help of Németh László, Hunspell developer, I had explored this path. But abandoned due to many limitation of Hunspell and lack of programmatic control of the morphological rules.

A Data-Driven Approach to Checking and Correcting Spelling Errors in Sinhala. Asanka Wasala, Ruvan Weerasinghe, Randil Pushpananda, Chamila Liyanage and Eranga Jayalatharachchi [pdf] This paper discuss the phonetic similarity based strategies to create a wordlist, instead of edit distance approach.

Last updated