# Large Language Models

* [Understanding Large Language Models](https://arxiv.org/abs/1510.00726s) -- A Transformative Reading List - Sebastian's whole site is very worth reading, start with this survey of LLM posts and literature
* [A Primer on Neural Network Models for Natural Language Processing](https://arxiv.org/abs/1510.00726) - Good idea to read everything Yoav has written but this is a great start
* Figures Everyone Should Know <https://github.com/ray-project/llm-numbers>
* Transformers from Scratch - This is the one I come back to every time. <https://e2eml.school/transformers.html>
* Illustrated Word2Vec - Jay's site is extremely good, this one is particularly good for Word2Vec <https://jalammar.github.io/illustrated-word2vec/>
* Attention? Attention! - Deep dive into the attention mechanism. A History of NLP - Great summary of the field over the last 20 or so years. <https://lilianweng.github.io/posts/2018-06-24-attention/>
* Dive into Deep Learning Course <https://d2l.ai/index.html>
* <https://arstechnica.com/science/2023/07/a-jargon-free-explanation-of-how-ai-large-language-models-work/>
* Indic-gemma-7b-Navarasa [Blog](https://ravidesetty.medium.com/introducing-indic-gemma-7b-2b-instruction-tuned-model-on-9-indian-languages-navarasa-86bc81b4a282), [Code](https://github.com/TeluguLLMLabs/Indic-gemma-7b-Navarasa)

<https://ig.ft.com/generative-ai/>

## RLHF

<https://towardsdatascience.com/rlhf-reinforcement-learning-from-human-feedback-faa5ff4761d1>

> "If we aim to match the performance of ChatGPT through open source, I believe we need to start taking training data more seriously. A substantial part of ChatGPT’s effectiveness might not come from, say, specific ML architecture, fine-tuning techniques, or frameworks. But more likely, it’s from the breadth, scale and quality of the instruction data.

> To put it bluntly, fine-tuning large language models on mediocre instruction data is a waste of compute. Let’s take a look at what has changed in the training data and learning paradigm—how we are now formatting the training data differently and therefore learning differently than in past large-scale pre-training."

## Local language - LLMS

* Kannada LLAMA <https://www.tensoic.com/blog/kannada-llama/>
* Malaysian Mistral <https://github.com/mesolitica/research-paper/blob/master/malaysian-mistral.pdf>
* MaLLaM Malaysia Large Language Model <https://github.com/mesolitica/research-paper/blob/master/mallam.pdf> <https://huggingface.co/mesolitica/mallam-1.1B-4096>
* Tamil LLAMA <https://arxiv.org/abs/2311.05845> and later <https://abhinand05.medium.com/breaking-language-barriers-introducing-tamil-llama-v0-2-and-its-expansion-to-telugu-and-malayalam-deb5d23e9264>
* Introducing Airavata: Hindi Instruction-tuned LLM <https://ai4bharat.github.io/airavata/>
* Malayalam LLM <https://github.com/VishnuPJ/MalayaLLM>
* AYA <https://huggingface.co/CohereForAI/aya-101>

## Copyright

* OpenAI says it’s “impossible” to create useful AI models without copyrighted material <https://arstechnica.com/information-technology/2024/01/openai-says-its-impossible-to-create-useful-ai-models-without-copyrighted-material/> - Further, OpenAI writes that limiting training data to public domain books and drawings "created more than a century ago" would not provide AI systems that "meet the needs of today's citizens."
* <https://www.aisnakeoil.com/p/generative-ais-end-run-around-copyright> We don’t think the injustice at the heart of generative AI will be redressed by the courts. Maybe changes to copyright law are necessary. Or maybe it will take other kinds of policy interventions that are outside the scope of copyright law. Either way, policymakers can’t take the easy way out.

## Courses

* <https://github.com/mlabonne/llm-course>
