Virtual Lecture by Hinrich Schütze on July 6 about Scaling Large Language Models

We look forward to a virtual talk that is hosted by the JAII. Our presenter, Prof. Hinrich Schütze (LMU Munich, Schütze lab) is a renowned expert in computational linguistics who will talk about “Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages”. The talk will be held on on July 6, 16:00-17.30 CET and this is the Zoom link to participate.

Large language models (LLMs) are currently the most active area of research in NLP. Most work has focused on what we call “vertical” scaling: making LLMs even better for a relatively small number of high-resource languages. We address “horizontal” scaling instead: extending LLMs to a large subset of the world’s languages, focusing on low-resource languages. Our Glot500-m model is trained on 500 languages, many of which are not covered by any other language model. I will talk about the major challenges we faced in creating Glot500: (i) finding, validating and cleaning training data for that many languages; (ii) evaluating performance of Glot500-m on languages for which native speakers and labeled datasets were not available to us; and (iii) determining the factors that ultimately make training on a language successful. We find that trying to reduce such factors to the so-called curse of multilinguality is naive and there is in fact also a “boon of multilinguality”. We are in the process of making Glot500-c, our training corpus covering 500 languages, publicly available.