๐Ÿ“š
Docs
  • Welcome
  • Santhosh Thottingal
    • Coding
    • Software I use
    • Research Papers
    • Talks
    • Projects
    • In news
    • Ideas
    • Books
  • Malayalam Computing
    • Unicode
      • Syllable
      • Conjunct
      • Articles
    • Input methods
      • Inscript
      • Swanalekha
      • Handwriting Recognition
        • Procrustes Analysis
      • Proprietory Input Methods
      • What is a good input method?
      • Typewriter
    • Script Rendering
      • Orthography
      • Ya Ra Va Signs
      • U signs
    • Type Design
      • Color Fonts
      • Curves
      • Design Ideas
      • Manjari
        • Gallery
      • Chilanka
      • Gayathri
      • Customize Malayalam fonts in Linux
      • Articles
      • Tools
      • Type classification
        • Display typefaces
    • Spellcheck
      • History
      • Dictionary based approach
      • Nature of Malayalam spelling mistakes
      • Morphology analyser based approach
      • Tools and services
      • Links
    • Hyphenation
      • Web page
    • Typesetting
      • LaTeX
      • Scribus
      • PDF
      • XeTeX
      • Indesign
      • Markup languages
    • Speech Recognition
    • Speech Synthesis
      • Dhvani
    • Collation
    • Corpus
    • Morphology Analysis
      • Mlmorph
        • Snippets
      • Part of speech tagging
      • Morphology complexity
    • Named Entity Recognition
    • Numbers
      • Number spellout
      • Hindi
    • Machine Translation
      • Neural Machine Translation
    • Optical Character recognition
    • Transliteration
    • Digitization
    • NLP
      • Low resource languages
      • Natural Language Generation
    • Grammar analysis
      • Style checkers
    • Dictionary
      • Lexicon
    • Natural Language Understanding
    • Natural Language Generation
    • Swathanthra Malayalam Computing
    • Meta
      • Malayalam Sign Language
      • เดชเดฆเดจเดฟเตผเดฎเดฟเดคเดฟ
      • History
      • เดฒเดฟเดชเดฟเดชเดฐเดฟเดฃเดพเดฎเด‚ เดจเดฟเดฒเดšเตเดšเตเดชเต‹เดฏเต‹?
      • เดญเดพเดทเดพ เดชเด เดจเด‚
      • เดถเตเดฐเต‡เดทเตเด  เดญเดพเดท
      • Dictionary
    • Encyclopedia
    • Government
      • Script
      • เด•เต‡เดฐเดณ เดญเดพเดทเดพ เด‡เตปเดธเตเดฑเตเดฑเดฟเดฑเตเดฑเตเดฏเต‚เดŸเตเดŸเต
  • Academic Research
    • Knowledge Dissemination
    • Research papers
    • Reproducible Research
  • Arts
  • Books
  • Blockchain
  • Computer Science
    • Data, Information, Knowledge
    • Theory of computation
    • Compilers and Interpreters
    • Graphics
    • Data Visualization
    • Parsers
    • Data Structures & Algorithms
    • Finite State Transducer
  • Cyberspace
    • Digital Governance
    • เด•เต‡เดฐเดณเดคเตเดคเดฟเตฝ
    • Online Abuse
  • Databases
  • Education
    • Finite State Transducers
    • Digital Education
    • Digital Literacy
      • เดกเดฟเดœเดฟเดฑเตเดฑเตฝ เดธเดพเด•เตเดทเดฐเดคเดพ เดชเดฆเตเดงเดคเดฟ
      • Resources
    • Remote Learning
    • General Learning
  • Entertainment
  • Frontend technology
    • Colors
    • Design systems
    • CSS
    • PWA
    • SPA
    • Vue
  • Generative Graphics
    • Drawbot
    • Matrix Digital Rain
  • Hardware
  • Internet
    • Etiquettes
    • Privacy
    • IPFS
    • Resilience
    • Decentralization
    • Network debugging tools
  • Knowledge Representation
  • Languages & Scripts
    • Arabic
    • Vattezhuth
  • Life
    • Digital Minimalism
  • Linux
  • Machine learning
    • Neural Networks
    • Dialog systems, Information retrieval
    • Large Language Models
    • Embedding
    • ML in Production
    • Retrieval Augmented Generation
  • Mathematics
  • Music
  • Parenting
  • Politics
    • Hatred, Hinduthwa, Nationalism
  • Productivity
  • Problem Solving
  • Science
  • Software Libraries
  • Software Engneering
    • Architecture
    • Product Management
    • Docker
    • Programming
      • Javascript
    • People
    • Performance
    • Code Review
  • Web3
  • Web Typography
  • Writing
  • เดชเดพเดŸเตเดŸเตเด•เตพ
    • เด•เตเดŸเตเดŸเดฟเดชเตเดชเดพเดŸเตเดŸเตเด•เตพ
  • เดฎเดฒเดฏเดพเดณเด‚ เด…เดšเตเดšเดŸเดฟ
  • เด—เดตเต‡เดทเดฃเดชเตเดฐเดฌเดจเตเดงเด™เตเด™เตพ
Powered by GitBook
On this page
  1. Malayalam Computing
  2. NLP

Low resource languages

PreviousNLPNextNatural Language Generation

Last updated 4 years ago

It is an analysis of parallel language corpus datasets popularly used in machine learning domain. The paper argues that the quality of these datasets are poor for low resource languages and raises many questions on its usage, reported results based on this poor data. For measuring the quality the authors recruited volunteers from many languages. CCAligned, ParaCwal, WikiMatrix are some such popular datasets. Wikimatrix is dataset from Wikipedia by aligning sentences from articles in multiple languages. This paper reports that "two-thirds of the audited sampleswere on average misaligned. We noticed that sentences were often similar in structure, but describeddifferent facts" - which is not a surprise given wikipedia articles are rarely exact translations.

They use a term called "Representation washing" to describe a problem with this low quality datasets for low resource languages. I liked it.

Since there are datasets which contain many low-resource languages, the community may feel a sense of progress and growing equity, despite the actual quality of the resources for these languages.Similarly, if low-quality datasets are used as bench-marks they may exaggerate model performance, making low-resource NLP appear more solved than it is โ€” or conversely, if models perform poorly when trained with such data, it may be wrongly assumed that the task of learning models for these languages is harder than it actually is or infeasible given current resources. These effects could result in productive effort being redirected away from these tasks and languages

Quality at a Glance: An Audit of Web-Crawled Multilingual DatasetsarXiv.org
Logo