Embedding

A Visual Guide to FastText Word Embeddings https://amitness.com/2020/06/fasttext-embeddings/
What Is FastText? Compared To Word2Vec & GloVe [How To Tutorial In Python] https://spotintelligence.com/2023/12/05/fasttext/

Morphology and word embeddings

MorphPiece : A Linguistic Tokenizer for Large Language Models https://arxiv.org/pdf/2307.07262.pdf Tokenization is a critical part of modern NLP pipelines. However, contemporary tokenizers for Large Language Models are based on statistical analysis of text corpora, without much consider- ation to the linguistic features. I propose a lin- guistically motivated tokenization scheme, Mor- phPiece, which is based partly on morphological segmentation of the underlying text. A GPT-style causal language model trained on this tokenizer (called MorphGPT) shows comparable or supe- rior performance on a variety of supervised and unsupervised NLP tasks, compared to the Ope- nAI GPT-2 model. Specifically I evaluated Mor- phGPT on language modeling tasks, zero-shot performance on GLUE Benchmark with various prompt templates, massive text embedding bench- mark (MTEB) for supervised and unsupervised performance, and lastly with another morphologi- cal tokenization scheme (FLOTA (Hofmann et al., 2022)) and find that the model trained on Mor- phPiece outperforms GPT-2 on most evaluations, at times with considerable margin, despite being trained for about half the training iterations.

Impact of Tokenization on Language Models: An Analysis for Turkish https://dl.acm.org/doi/10.1145/3578707 Tokenization is an important text preprocessing step to prepare input tokens for deep language models. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, in which many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, that is, their outputs vary from the smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using the RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that the morphological-level tokenizer delivers a challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological- and Word-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.

Superbizarre Is Not Superb: Derivational Morphology Improves BERT’s Interpretation of Complex Words - https://arxiv.org/pdf/2101.00403.pdf

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese https://aclanthology.org/2023.acl-srw.5.pdf

FLOTA An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers. Code: https://github.com/valentinhofmann/flota/

Semantic properties of English nominal pluralization: Insights from word embeddings https://arxiv.org/abs/2203.15424

A generating model for Finnish nominal inflection using distributional semantics https://osf.io/preprints/psyarxiv/ndtv2

Cross-Lingual Word Embeddings for Morphologically Rich Languages https://aclanthology.org/R19-1140.pdf

Characters or Morphemes: How to Represent Words? https://aclanthology.org/W18-3019/ In this paper, we investigate the effects of using subword information in representation learning. We argue that using syntactic subword units effects the quality of the word representations positively. We introduce a morpheme-based model and compare it against to word-based, character-based, and character n-gram level models. Our model takes a list of candidate segmentations of a word and learns the representation of the word based on different segmentations that are weighted by an attention mechanism. We performed experiments on Turkish as a morphologically rich language and English with a comparably poorer morphology. The results show that morpheme-based models are better at learning word representations of morphologically complex languages compared to character-based and character n-gram level models since the morphemes help to incorporate more syntactic knowledge in learning, that makes morpheme-based models better at syntactic tasks.

Morphological Word-Embeddings

https://aclanthology.org/N15-1140.pdf

Linguistic similarity is multi-faceted. For instance, two words may be similar with respect to semantics, syntax, or morphology inter alia. Continuous word-embeddings have been shown to capture most of these shades of similarity to some degree. This work considers guiding word-embeddings with morphologically annotated data, a form of semi- supervised learning, encouraging the vectors to encode a word’s morphology, i.e., words close in the embedded space share morphological features. We extend the log-bilinear model to this end and show that indeed our learned embeddings achieve this, using German as a case study.

Improve word embedding using both writing and pronunciation

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0208785

this paper proposes the concept of a pronunciation-enhanced word embedding model (PWE) that integrates speech information into training to fully apply the roles of both speech and writing to meaning. This paper uses the Chinese language, English language and Spanish language as examples and presents several models that integrate word pronunciation characteristics into word embedding. Word similarity and text classification experiments show that the PWE outperforms the baseline model that does not include speech information. Language is a storehouse of sound-images; therefore, the PWE can be applied to most languages.

Word Ordering: Two/Too Simple Adaptations of Word2Vec for Syntax Problems

https://aclanthology.org/N15-1142/

We present two simple modifications to the models in the popular Word2Vec tool, in order to generate embeddings more suited to tasks involving syntax. The main issue with the original models is the fact that they are insensitive to word order. While order independence is useful for inducing semantic representations, this leads to suboptimal results when they are used to solve syntax-based problems. We show improvements in part-of-speech tagging and dependency parsing using our proposed models

Compositional Morphology for Word Representations and Language Modelling

https://proceedings.mlr.press/v32/botha14.html

References:

What are embeddings? https://vickiboykis.com/what_are_embeddings/next.html https://raw.githubusercontent.com/veekaybee/what_are_embeddings/main/embeddings.pdf

PreviousLarge Language Models NextML in Production

Last updated 1 year ago