šŸ“š
Docs
  • Welcome
  • Santhosh Thottingal
    • Coding
    • Software I use
    • Research Papers
    • Talks
    • Projects
    • In news
    • Ideas
    • Books
  • Malayalam Computing
    • Unicode
      • Syllable
      • Conjunct
      • Articles
    • Input methods
      • Inscript
      • Swanalekha
      • Handwriting Recognition
        • Procrustes Analysis
      • Proprietory Input Methods
      • What is a good input method?
      • Typewriter
    • Script Rendering
      • Orthography
      • Ya Ra Va Signs
      • U signs
    • Type Design
      • Color Fonts
      • Curves
      • Design Ideas
      • Manjari
        • Gallery
      • Chilanka
      • Gayathri
      • Customize Malayalam fonts in Linux
      • Articles
      • Tools
      • Type classification
        • Display typefaces
    • Spellcheck
      • History
      • Dictionary based approach
      • Nature of Malayalam spelling mistakes
      • Morphology analyser based approach
      • Tools and services
      • Links
    • Hyphenation
      • Web page
    • Typesetting
      • LaTeX
      • Scribus
      • PDF
      • XeTeX
      • Indesign
      • Markup languages
    • Speech Recognition
    • Speech Synthesis
      • Dhvani
    • Collation
    • Corpus
    • Morphology Analysis
      • Mlmorph
        • Snippets
      • Part of speech tagging
      • Morphology complexity
    • Named Entity Recognition
    • Numbers
      • Number spellout
      • Hindi
    • Machine Translation
      • Neural Machine Translation
    • Optical Character recognition
    • Transliteration
    • Digitization
    • NLP
      • Low resource languages
      • Natural Language Generation
    • Grammar analysis
      • Style checkers
    • Dictionary
      • Lexicon
    • Natural Language Understanding
    • Natural Language Generation
    • Swathanthra Malayalam Computing
    • Meta
      • Malayalam Sign Language
      • ą“Ŗą“¦ą“Øą“æąµ¼ą“®ą“æą“¤ą“æ
      • History
      • ą“²ą“æą“Ŗą“æą“Ŗą“°ą“æą“£ą“¾ą“®ą“‚ ą“Øą“æą“²ą“šąµą“šąµą“Ŗąµ‹ą“Æąµ‹?
      • ą“­ą“¾ą“·ą“¾ ą“Ŗą“ ą“Øą“‚
      • ą“¶ąµą“°ąµ‡ą“·ąµą“  ą“­ą“¾ą“·
      • Dictionary
    • Encyclopedia
    • Government
      • Script
      • ą“•ąµ‡ą“°ą“³ ą“­ą“¾ą“·ą“¾ ą“‡ąµ»ą“øąµą“±ąµą“±ą“æą“±ąµą“±ąµą“Æąµ‚ą“Ÿąµą“Ÿąµ
  • Academic Research
    • Knowledge Dissemination
    • Research papers
    • Reproducible Research
  • Arts
  • Books
  • Blockchain
  • Computer Science
    • Data, Information, Knowledge
    • Theory of computation
    • Compilers and Interpreters
    • Graphics
    • Data Visualization
    • Parsers
    • Data Structures & Algorithms
    • Finite State Transducer
  • Cyberspace
    • Digital Governance
    • ą“•ąµ‡ą“°ą“³ą“¤ąµą“¤ą“æąµ½
    • Online Abuse
  • Databases
  • Education
    • Finite State Transducers
    • Digital Education
    • Digital Literacy
      • ą“”ą“æą“œą“æą“±ąµą“±ąµ½ ą“øą“¾ą“•ąµą“·ą“°ą“¤ą“¾ ą“Ŗą“¦ąµą“§ą“¤ą“æ
      • Resources
    • Remote Learning
    • General Learning
  • Entertainment
  • Frontend technology
    • Colors
    • Design systems
    • CSS
    • PWA
    • SPA
    • Vue
  • Generative Graphics
    • Drawbot
    • Matrix Digital Rain
  • Hardware
  • Internet
    • Etiquettes
    • Privacy
    • IPFS
    • Resilience
    • Decentralization
    • Network debugging tools
  • Knowledge Representation
  • Languages & Scripts
    • Arabic
    • Vattezhuth
  • Life
    • Digital Minimalism
  • Linux
  • Machine learning
    • Neural Networks
    • Dialog systems, Information retrieval
    • Large Language Models
    • Embedding
    • ML in Production
    • Retrieval Augmented Generation
  • Mathematics
  • Music
  • Parenting
  • Politics
    • Hatred, Hinduthwa, Nationalism
  • Productivity
  • Problem Solving
  • Science
  • Software Libraries
  • Software Engneering
    • Architecture
    • Product Management
    • Docker
    • Programming
      • Javascript
    • People
    • Performance
    • Code Review
  • Web3
  • Web Typography
  • Writing
  • ą“Ŗą“¾ą“Ÿąµą“Ÿąµą“•ąµ¾
    • ą“•ąµą“Ÿąµą“Ÿą“æą“Ŗąµą“Ŗą“¾ą“Ÿąµą“Ÿąµą“•ąµ¾
  • ą“®ą“²ą“Æą“¾ą“³ą“‚ ą“…ą“šąµą“šą“Ÿą“æ
  • ą“—ą“µąµ‡ą“·ą“£ą“Ŗąµą“°ą“¬ą“Øąµą“§ą“™ąµą“™ąµ¾
Powered by GitBook
On this page
  1. Malayalam Computing
  2. Spellcheck

Morphology analyser based approach

PreviousNature of Malayalam spelling mistakesNextTools and services

Last updated 4 years ago

To check if a word is valid, known, correctly spelled word, a simple look up using morphology analyser is enough. If the morphology analyser can parse the word, it is correctly spelled. Note that the word can be an agglutinated at arbitrary levels and inflected at same time.

To provide spelling suggestions, can be used. This is a three step process

  1. Generate a list of candidate words from the query word. The words in this list may be incorrect too. The words are generated based on the patterns we defined based on the nature of spelling mistakes. We scan the query word for common patterns of errors and apply fix for that pattern. Since there dozens of patterns, we will have many candidate words.

  2. From the candidate list, find out the correctly spelled word using spellcheck method. This will result a very small number of words. These words are the probable replacements for the misspelled query word.

  3. Sort the candidate words to provide more probable suggestion as the first one. For this, we can do a ranking on the suggestion strategies. A very common error pattern get high priority at step 1. So the suggestions from that appear first in the candidate list. A more sophisticated approach would use a frequency model for the words. So candidate words that are very frequent in the language will appear as first candidate.

One thing I observed from the above approach is, in reality the candidate words after all the above steps for Malayalam is most of the time one or two. This make step 3 less relevant. At the same time, an edit distance based approach would have generated more than 5 candidate words for each misspelled word. The candidates from the edit distance based suggestion mechanism would be very diverse, meaning, they won’t have be related to the indented word at all. The following images illustrates the difference.

Out of lexicon words

As part of the Morphology analyser, the expansion of the lexicon is a never ending task. As the lexicon grows, the spellchecker improves automatically.

Compared to the finite set word list, the FST based morphology analyser and generator system covers large number of words using its generation system based on morpho-phonotactics. For a discussion on this see my previous Since every language vocabulary is a dynamic system, it is still impossible to cover 100% words in a language all the time. New words get added to language every now and then. There are nouns related to places, people names, product names etc that is not in the lexicon of Morphology analyser. So, these words will be reported as unknown words by the spellchecker. Unknown word is interpreted as misspelled word too. This issue is a known problem. But since a spellchecker is often used by a human user, the severity of the issue depends whether the spellchecker does not know about lot of commonly used words or not. Most of the spellcheckers provide an option to add to dictionary to avoid this issue.

Finite-State Spell-Checking with Weighted Language and Error Models—Building and Evaluating Spell-Checkers with Wikipedia as Corpus Tommi A Pirinen, Krister Lindén [] This paper outlines the usage of Finite state transducer technique to address the issue of infinite dictionary of morphologically rich languages. They use Finnish as the example language

blog post about the coverage test.
Morphology Analysis
pdf
the FST based morphology analyser
Spelling suggestion based on morphology analysis approach
Spelling suggestions from edit distance based candidates