๐Ÿ“š
Docs
  • Welcome
  • Santhosh Thottingal
    • Coding
    • Software I use
    • Research Papers
    • Talks
    • Projects
    • In news
    • Ideas
    • Books
  • Malayalam Computing
    • Unicode
      • Syllable
      • Conjunct
      • Articles
    • Input methods
      • Inscript
      • Swanalekha
      • Handwriting Recognition
        • Procrustes Analysis
      • Proprietory Input Methods
      • What is a good input method?
      • Typewriter
    • Script Rendering
      • Orthography
      • Ya Ra Va Signs
      • U signs
    • Type Design
      • Color Fonts
      • Curves
      • Design Ideas
      • Manjari
        • Gallery
      • Chilanka
      • Gayathri
      • Customize Malayalam fonts in Linux
      • Articles
      • Tools
      • Type classification
        • Display typefaces
    • Spellcheck
      • History
      • Dictionary based approach
      • Nature of Malayalam spelling mistakes
      • Morphology analyser based approach
      • Tools and services
      • Links
    • Hyphenation
      • Web page
    • Typesetting
      • LaTeX
      • Scribus
      • PDF
      • XeTeX
      • Indesign
      • Markup languages
    • Speech Recognition
    • Speech Synthesis
      • Dhvani
    • Collation
    • Corpus
    • Morphology Analysis
      • Mlmorph
        • Snippets
      • Part of speech tagging
      • Morphology complexity
    • Named Entity Recognition
    • Numbers
      • Number spellout
      • Hindi
    • Machine Translation
      • Neural Machine Translation
    • Optical Character recognition
    • Transliteration
    • Digitization
    • NLP
      • Low resource languages
      • Natural Language Generation
    • Grammar analysis
      • Style checkers
    • Dictionary
      • Lexicon
    • Natural Language Understanding
    • Natural Language Generation
    • Swathanthra Malayalam Computing
    • Meta
      • Malayalam Sign Language
      • เดชเดฆเดจเดฟเตผเดฎเดฟเดคเดฟ
      • History
      • เดฒเดฟเดชเดฟเดชเดฐเดฟเดฃเดพเดฎเด‚ เดจเดฟเดฒเดšเตเดšเตเดชเต‹เดฏเต‹?
      • เดญเดพเดทเดพ เดชเด เดจเด‚
      • เดถเตเดฐเต‡เดทเตเด  เดญเดพเดท
      • Dictionary
    • Encyclopedia
    • Government
      • Script
      • เด•เต‡เดฐเดณ เดญเดพเดทเดพ เด‡เตปเดธเตเดฑเตเดฑเดฟเดฑเตเดฑเตเดฏเต‚เดŸเตเดŸเต
  • Academic Research
    • Knowledge Dissemination
    • Research papers
    • Reproducible Research
  • Arts
  • Books
  • Blockchain
  • Computer Science
    • Data, Information, Knowledge
    • Theory of computation
    • Compilers and Interpreters
    • Graphics
    • Data Visualization
    • Parsers
    • Data Structures & Algorithms
    • Finite State Transducer
  • Cyberspace
    • Digital Governance
    • เด•เต‡เดฐเดณเดคเตเดคเดฟเตฝ
    • Online Abuse
  • Databases
  • Education
    • Finite State Transducers
    • Digital Education
    • Digital Literacy
      • เดกเดฟเดœเดฟเดฑเตเดฑเตฝ เดธเดพเด•เตเดทเดฐเดคเดพ เดชเดฆเตเดงเดคเดฟ
      • Resources
    • Remote Learning
    • General Learning
  • Entertainment
  • Frontend technology
    • Colors
    • Design systems
    • CSS
    • PWA
    • SPA
    • Vue
  • Generative Graphics
    • Drawbot
    • Matrix Digital Rain
  • Hardware
  • Internet
    • Etiquettes
    • Privacy
    • IPFS
    • Resilience
    • Decentralization
    • Network debugging tools
  • Knowledge Representation
  • Languages & Scripts
    • Arabic
    • Vattezhuth
  • Life
    • Digital Minimalism
  • Linux
  • Machine learning
    • Neural Networks
    • Dialog systems, Information retrieval
    • Large Language Models
    • Embedding
    • ML in Production
    • Retrieval Augmented Generation
  • Mathematics
  • Music
  • Parenting
  • Politics
    • Hatred, Hinduthwa, Nationalism
  • Productivity
  • Problem Solving
  • Science
  • Software Libraries
  • Software Engneering
    • Architecture
    • Product Management
    • Docker
    • Programming
      • Javascript
    • People
    • Performance
    • Code Review
  • Web3
  • Web Typography
  • Writing
  • เดชเดพเดŸเตเดŸเตเด•เตพ
    • เด•เตเดŸเตเดŸเดฟเดชเตเดชเดพเดŸเตเดŸเตเด•เตพ
  • เดฎเดฒเดฏเดพเดณเด‚ เด…เดšเตเดšเดŸเดฟ
  • เด—เดตเต‡เดทเดฃเดชเตเดฐเดฌเดจเตเดงเด™เตเด™เตพ
Powered by GitBook
On this page
  • POS tagging for morphologically rich languages
  • mlmorph tagset
  • BIS POS tagset
  • Malayalam Monolingual Text Corpus ILCI-II
  • Conclusion
  • References
  1. Malayalam Computing
  2. Morphology Analysis

Part of speech tagging

PreviousSnippetsNextMorphology complexity

Last updated 4 years ago

Identifying the part of speech, or the grammatical category of the word is one of the fundamental requirement for higher level analysis of text. In a sentence โ€œShe lives at Palakkadโ€, identifying โ€˜sheโ€™ as a pronoun, โ€˜livesโ€™ as verb with a specific tense, Palakkad as a Proper noun, specifically as a place name is crucial to understand the semantics of the text. There are rule based and statistical approaches for identifying these categories. We will not discuss those methods in this article, but once that identification is done, the result is the text with each word annotated with a tag. For example, here is a POS tagged sentence in English:

There/EX are/VBP 70/CD children/NNS there/RB

Here, EX, VBP, CD, NNS, RB are POS tags. Specifically, these are tags defined in . It has 45-tags, used to label many corpora in English.

Penn treebank POS tagset

There are alternate tagsets such as Brown tagset, which defines 87 tags for English. The members of the tagset is defined based on language characteristics and how detailed analysis is required. For example, In Penn tagset IN is used for both subordinating conjunction like if, when, unless, after and prepositions like in, on, after. A different tagset may define separate tags for them, so that it would be possible to differentiate them.

POS tagging for morphologically rich languages

Languages with rich morphology require a more complex tagging scheme and methods. Malayalam is one such language, so is many of the dravidian languages, Turkish, Hungarian, Finnish, Czech and many others. A rich morphology language has more information in a word compared to languages like English. If the word is agglutinated and inflected, it has multiple words and inflection information. Since POS tagging is the basis for higher level information processing, extracting as much information as possible from the word is important.

Language

Corpus size

Unique words

English

10 million

97,734

Turkish

10 million

4,17,775

Malayalam

10 million

14,27,392

In English, lot of information about the syntactic function of a word is represented by word order or neighborimg function words. For example in the phrase at Palakkad the word at and its word order in the sentence gives the place name Palakkad its locative inflection. If we consider the same word in Malayalam, เดชเดพเดฒเด•เตเด•เดพเดŸเตเดŸเดฟเตฝ, the word เดชเดพเดฒเด•เตเด•เดพเดŸเต is inflected(locative) and contains the whole information. Identifying เดชเดพเดฒเด•เตเด•เดพเดŸเตเดŸเดฟเตฝ just as Proper noun is not sufficient. The nominal inflection, that is is locative here, should also be identified.

For this reason, the tagging system for agglutinative, inflective languages uses a sophisticated tagging system and has bigger tag set larger than the 50-100 tags we have seen for English. The general practice is to use a sequence of tags rather than a single primitive tag. An example from (Hakkani-T ฬˆur et al., 2002):

  • Sentence: Yerdeki izin temizlenmesi gerek.

  • English: The trace on the floor should be cleaned

  • POS tagging for izin: z +Noun+A3sg+Pnon+Gen

A morphology analyser is used for this tagging. The tag set for these languages are huge. In such a morphologically analysed and tagged MULTEXT-East corpora in English, Czech, Estonian, Hungarian, Romanian, and Slovene(Dimitrova et al, 1998; Erjavec 2004, Hajic, 2000) gives the following tagset size

Language

tagset size

English

139

Czech

970

Estonian

476

Hungarian

401

Romanian

486

Slovene

1033

mlmorph tagset

BIS POS tagset

  1. Noun: 3 tags are defined for Common noun, Proper Noun and Locative inflection - NN, NNP, NST. There is no tag to differentiate a singular noun or plural noun. No tags for gender too. I am not sure why locative is defined while accusative, dative, genitive, instrumental, sociative, and vocative nominal inflections are omitted. The document does not give much example for NST too.

  2. Pronoun: 5 tags are defined for Personal pronoun(PRP), Reflexive Pronoun(PRF), Relative(PRL), Reciprocal(PRC) and Pronoun question word(PRQ)

  3. Demonstrative: 3 tags are defined: Deictive(DMD), Relative(DMR), Demonstrative question(DMQ). It should be noted that Malayalam has demonstrative prefixing for โ€œเดšเตเดŸเตเดŸเต†เดดเตเดคเตเดคเตเด•เตพโ€. For example, เด…เด•เตเด•เดพเดฐเตเดฏเด‚, เด‡เด•เตเด•เดพเดฐเตเดฏเด‚, เด…เด•เตเด•เดพเดฃเตเดฎเตเดฎเดพเดฎเดฒเดฏเตŠเดจเตเดจเตเด‚ shows that demonstrative prefixing. The document does not discuss them.

  4. Verb: Verbs are divided to Main, Verbal(VN) and Auxilary(VAUX) categories. Main verb can be Finite(VF), Non Finite(VNF), Infinitive(VINF). Total 6 tags. Obviously the tense information is not captured. Verbs in Malayalam get inflected based on tense, mood, voice and aspect. Verbs are inflected for present, past and future tenses. Perfect, habitual and iterative aspects are very common. Iterative aspect has tense and emphatic variations. Verbs get inflected with causative and passive voices as well. A variety of mood forms such as abilitative, imperative, compulsive, promissive, optative, purposive, permissive, precative, irrealis,monitory, conditional, satisfactive exist. All o fthese forms are supported by Mlmorph. It is a serious omission in BIS POS set. If you are not able to extract at least tense information, I donโ€™t know how useful a POS tagging is.

  5. Adjective: JJ tag is used here. Here also the agglutinative nature of Malayalam adjectives is not addressed. Consider เดจเต€เดฒเดคเตเดคเดพเดฎเดฐ -here เดจเต€เดฒ is adjective to เดคเดพเดฎเดฐ- a perfect case you need sequence of POS tags as we discussed earlier. A related characterestics of Malayalam - coordinatives(เดฆเตเดตเดจเตเดฆเดธเดฎเดพเดธเด‚) is missed here. For example, in the word เด…เดšเตเดšเดจเดฎเตเดฎเดฎเดพเตผ - here it is tricky to avoid interpreting เด…เดšเตเด›เตป as adjective of เด…เดฎเตเดฎ.

  6. Adverb: RB tag is used here, which seems directly copied from PENN POS set

  7. Postposition: PSP tag is defined here.

  8. Conjunction: Coordinator(CCD), Subordinator(CCS) and Quotative(UT) are defined here. Examples are confusing.

  9. Particles: Default(RPD), Classifier(C), Interjection(INJ), Negation(NEG) are defined here. Curiously Affirmative is missing when NEG is present.

  10. Residuals: Foriegn words(RDF), Symbol(SYM), Punctuation(PUNC), Unknown words(UNK) and Echowords(ECH) are defined in this section. There is no explanation or example on what is meant by Echowords. The symbols and punctuations are more or less same from examples given. Since there is no example or explanation for Foreign word, I am not sure if it is English words written in English for example or words originated from other languages such as Sanskrit. Mlmorph has sanskrit tag when the morpheme of the word is from Sanskrit. Knowing this is important since such words have completely different agglutination rules. For example เด†เดถเดพเดคเต€เดฐเด‚ vs เด•เดŸเตฝเดคเตเดคเต€เดฐเด‚-the เด†เดถ->เด†เดถเดพ adjective form is from sanskrit origins.

General comments

  1. A total of 36 tags defined for Malayalam while a morphologically poor language has at least 45, does not take any language characterstics into consideration.

  2. A lot of word information can not be captured because of missing tags. Even Penn POS has tense information.

  3. Poor documentation and examples.

  4. Does not discuss the morphology of the languages or does not provide any detail on how rich morphology of Malayalam is addressed in this tagging system.

  5. Sequential tagging is not discussed at all.

  6. One of the worst Malayalam font is used with lot of rendering mistakes, adding to the confusing examples.

In general BIS tag set is incomplete for Malayalam. It is more obvious from the example tagging given in the same document.

Malayalam tagging examples from BIS POS tag document

  1. Let us list the words that are tagged as N_NN-(Common noun): เดชเดŸเตเดŸเดฃเด™เตเด™เตพ, เดชเตเดฃเตเดฏเดจเด—เดฐเดฟเด•เตพ, เดชเตเดฃเตเดฏเดธเตเดฅเดฒเด™เตเด™เตพเด•เตเด•เต, เดชเตเดฃเตเดฏเดธเตเดฅเดฒเด™เตเด™เดณเตเดŸเต†, เดธเตเดฅเดฒเด™เตเด™เตพเด•เตเด•เตเด‚,เดงเตผเดฎเตเดฎเดธเตเดฅเดฒเด™เตเด™เดณเตเด‚, เดคเต€เตผเดคเตเดฅเดพเดŸเดจเดธเตเดฅเดฒเด™เตเด™เดณเตเด‚, เดฎเต‹เด•เตเดทเด‚, เดนเดฟเดจเตเดฆเต, เดฎเดนเดคเตเดตเด‚, เดถเตเดฐเต‡เดทเตเด เดคเดฏเตเด‚, เด†เดฆเดฐเดตเตเด‚, เด—เตเดฐเดจเตเดฅเด™เตเด™เดณเดฟเตฝ. It is obvious that tagging all of these as N_NN is a very generic tagging. We lost plural, inflections, adjectives, Conjunction and many more information.

  2. เด†เดฃเต is tagged as auxilary verb V_AUX, while its use here is Affirmative.

  3. เดฎเต‹เด•เตเดทเดชเตเดฐเดฆเดพเดฏเด•เดฎเดพเดฃเต†เดจเตเดจเต - this is a good example, you can see agglutination of 4 words - เดฎเต‹เด•เตเดทเด‚, เดชเตเดฐเดฆเดพเดฏเด•เด‚, เด†เดฃเต, เดŽเดจเตเดจเต - tagging all of them together as Commoun Noun has no use.

Now I will attempt to prove my observation by actually using a corpus that is tagged using the above tag system and provided by TDIL.

Malayalam Monolingual Text Corpus ILCI-II

Let us take a tagged sentence for analysis:

YACD54	เด•เดฐเต€เดฎเดฟเดจเตเดฑเต†\N_NNP `\RD_PUNC เดชเดฑเดฏเดพเดจเตโ€\N_NNP เดฌเดพเด•เตเด•เดฟเดตเต†เดšเตเดšเดคเต\N_NNP
`\RD_PUNC ,\RD_PUNC เด…เดจเดฟเดฒเตโ€\N_NNP เดคเต‡เดพเดฎเดธเดฟเดจเตเดฑเต†\N_NNP
`\RD_PUNC เดฎเดฐเด‚\N_NNP เดชเต†เดฏเตเดฏเตเดฎเตเดชเต‡เดพเดณเตโ€\N_NNP `\RD_PUNC เดŽเดจเตเดจเดฟเดต\N_NN
เดชเตเดฐเดฆเดฐเตโ€เดถเดจเดคเตเดคเดฟเดจเต\N_NN เดคเดฏเตเดฏเดพเดฑเดพเดฏเดฟ\RB เดจเดฟเดฒเตโ€เด•เตเด•เตเดจเตเดจเต\V_VM_VF .\RD_PUNC

The above sentence is from mal_art and culture_set1.txt in the corpus. YACD54 is sentence Id.

  1. Words that are tagged as Proper Noun(N_NNP): เด•เดฐเต€เดฎเดฟเดจเตเดฑเต†, เดชเดฑเดฏเดพเดจเตโ€, เดฌเดพเด•เตเด•เดฟเดตเต†เดšเตเดšเดคเต, เด…เดจเดฟเดฒเตโ€, เดฎเดฐเด‚, เดชเต†เดฏเตเดฏเตเดฎเตเดชเต‡เดพเดณเตโ€. Here เดชเดฑเดฏเดพเดจเตโ€, เดฌเดพเด•เตเด•เดฟเดตเต†เดšเตเดšเดคเต, เดชเต†เดฏเตเดฏเตเดฎเตเดชเต‡เดพเดณเตโ€ are verb or verb derived words. It should never tagged as nouns

  2. Words that are tagged as Common noun(N_NN): เดชเตเดฐเดฆเดฐเตโ€เดถเดจเดคเตเดคเดฟเดจเต, เดŽเดจเตเดจเดฟเดต,. None of them are nouns.

  3. Words that are tagged as Adverb: เดคเดฏเตเดฏเดพเดฑเดพเดฏเดฟ.

  4. Words thare are tagged as Finite verbs: เดจเดฟเดฒเตโ€เด•เตเด•เตเดจเตเดจเต

If I understood correctly this is a mannually tagged corpus. And as we see, excluding punctuations, I would say 3 out of 11 words are tagged almost correctly- เด…เดจเดฟเตฝ, เดคเดฏเตเดฏเดพเดฑเดพเดฏเดฟ, เดจเดฟเดฒเตโ€เด•เตเด•เตเดจเตเดจเต.

I will list a few more samples for your analysis.

MYGD42	เดตเดพเดฒเตเดฎเต€เด•เดฟ\N_NNP เดฐเดพเดฎเดพเดฏเดฃเดคเตเดคเดฟเดฒเตเด‚\N_NNP เดญเดพเดธเดจเตเดฑเต†\N_NNP เด•เตƒเดคเดฟเด•เดณเดฟเดฒเตเด‚\N_NN
เดŽเดฒเตเดฒเดพเด‚\QT_QTF เด‡เดตเดฟเดŸเตเดคเตเดคเต†\N_NST เดชเตผเดตเตเดตเดคเด™เตเด™เดณเต†\N_NN เดชเดฐเดพเดฎเตผเดถเดฟเดšเตเดšเดฟเดฐเดฟเด•เตเด•เตเดจเตเดจเดคเต\V_VM_VNF
เด•เดพเดฃเดพเด‚\V_VM_VNF .\RD_PUNC
MYLTD52 เดชเดพเดฏ\N_NN เด•เต†เดŸเตเดŸเดฟเดฏ\JJ เดตเดฒเดฟเดฏ\JJ เดตเดžเตเดšเดฟเด•เดณเตเด‚\N_NN เดฎเต€เดจเตโ€\N_NN เดชเดฟเดŸเดฟเด•เตเด•เตเดจเตเดจ\V_VM_VNF
เด•เตŠเดšเตเดšเตเดคเต‹เดฃเดฟเด•เดณเตเด‚\N_NN เดชเต‹เด•เดพเดจเตโ€\V_VM_VNF เดคเตเดŸเด™เตเด™เตเด‚\V_VM_VNF .\RD_PUNC

Conclusion

Malayalam is a morphologically rich language and require sequence based POS tagging system with wide set of POS tags and Feature tags. A smaller POS tagging system like BIS POS tagging system does not address the language characteristics. The POS tag set itself is incomplete and not prepared with details. Using such a tag system will miss most of the important POS information required for higher level processing. The tagging examples given in the POS tag document and the corpus provided by TDIL are full mistakes and make me wonder whether it went through any review at all. I would not advice to use that corpus for any statistical training purpose or any reference purpose.

Even though I used Malayalam language as example, the BIS tag set has same tags for other languages as well. I would argue that those languages also face more or less same issues I explained in this article.

Thanks for reading!

References

  1. Oravecz, C. and Dienes, P. (2002). Efficient stochastic part-of-speech tagging for Hungarian. InLREC-02, Las Palmas,Canary Islands, Spain, pp. 710โ€“717

  2. Hakkani-T ฬˆur, D., Oflazer, K., and T ฬˆur, G. (2002). Statistical morphological disambiguation for agglutinative languages.Journal of Computers and Humanities,36(4), 381โ€“410.

  3. Daniel Jurafsky, James H Martin. Speech and Language Porcessing. Second edition. Chapter 5.

To understand the productive word formation in a morphologically rich language, compared to English, a corpora analysis can be used. A 250,000 word token corpus of Hungarian has more than twice as many word types as a similarly sized corpus of English (Oravecz and Dienes, 2002). A 10 million word corpus of Turkish has 4 times unique words compared to similarly sized English corpus(Hakkani-T ฬˆur et al., 2002). A 10 million word corpus of Malayalam has 14 times unique words compared to similarly sized English corpus, as calcualted from .

project which defines 16 POS tags and an extensive feature tags to tag any language is worth mentioning here. Mlmorph uses the tagset from Universal Dependencies.

Mlmorph uses the sequence based tag set. Currently there are 87 tags - you can refer it here: A word เดชเดพเดฒเด•เตเด•เดพเดŸเตเดŸเดฟเตฝ will be analysed as เดชเดพเดฒเด•เตเด•เดพเดŸเต<np><locative>. Similarly เดคเดฟเดฐเตเดตเดจเดจเตเดคเดชเตเดฐเดตเตเดฎเดพเดฃเต will be tagged as เดคเดฟเดฐเตเดตเดจเดจเตเดคเดชเตเดฐเด‚<np>เด‰เด‚<cnj>เด†เดฃเต<aff>. As you can see we are extracting maximum information out of the words for higher level processing. The number of unique pos tag sequences is not finite.

attempts to define a common tagset for all Indian languages. I will focus on Malayalam language here, but the tagset is mostly same for other languages too. The tags are defined in 11 categories.

Quantifiers: General(QTF), Cardinals(QTC), Ordinals(QTO) are defined. I have written extensively on why Malayalam need a large tagset about numbers in my article about . Malayalam numbers are spelled using agglutinated words and it is important to recognize the digits and place value from it.

Under the Indian Languages Corpora Initiative phase โ€“II (ILCI Phase-II) project, initiated by the MeitY, Govt. of India, Jawaharlal Nehru University, New Delhi had collected . This is the final outcome of the project and there are approx. 31,000 sentences of general domain. It uses the BIS tag system. This corpus is available in TDIL website to download, but it is not straight forward. To download the complete corpus, you need to register in the site and fill a form, sign and send the physical copy by post to TDIL to get download link. The corpus has very restrictive terms of use. You can only use it for research. The same site also provide a sample version of corpus which has about 30% of original corpus. For my analysis, I used that smaller version.

is a framework for consistent annotation of grammar (parts of speech, morphological features, and syntactic dependencies) across different human languages. UD is an open community effort with over 300 contributors producing more than 150 treebanks in 90 languages.

SMC Corpus
The Universal Dependencies
https://gitlab.com/smc/mlmorph/blob/master/tags.json
The BIS pos tagset
number spellout
monolingual corpus in Malayalam
Universal Dependencies (UD)
PENN treebank POS tags