Delivery of the largest "open science" multilingual language model ever produced

While they regularly deliver fascinating results, large-scale artificial intelligence models are generally black boxes: we don't know exactly how they calculate their answers, and many elements are not made public.

The BigScience project, involving a thousand researchers in a participatory and open science approach, is changing the game with "Bloom". This is the largest multilingual language model trained in a completely open and transparent way. This type of artificial intelligence simultaneously learns a text generation model and a text representation model by repeatedly performing an elementary task: predicting the next word in a text whose beginning is known, in the same way as "intelligent" keyboards do. In addition to handling 46 languages, from English to Basque, its open-science nature will help scientists from all horizons to explore the workings of language models in order to improve them. The BigScience project, initiated by the Hugging Face company, was supported by the CNRS, GENCI1 and the French Ministry of Higher Education and Research, enabling Bloom to be trained on the "Jean Zay" machine, one of Europe's most powerful supercomputers.

Language models are artificial intelligences whose first applications concern texts in natural language: answers to questions, automatic generation of sentences, detection of "feelings", automatic summarization and simplification or even automatic translation.

Generally designed by giants of new technologies, most existing models have been trained only with texts written in English and according to principles and methods that are difficult to reproduce in all their details. It is not possible, for example, to know when a model answers a question whether the answer is the result of a calculation or whether the answer was already in its training databases.

The BigScience project was initiated in spring 2021 by French-American artificial intelligence start-up Hugging Face, to remedy these problems by training a new model: Bloom. It learns from large text corpora, using a simple principle of predicting to complete sentences, word by word. Each prediction made by the model is compared with the correct word, enabling the model's internal parameters to be adjusted. In the case of Bloom, this is achieved by evaluating thousands of billions of words, leading to a model containing 176 billion parameters. This learning process lasted several months, requiring hundreds of graphics processing units (GPUs) running in parallel - the equivalent of 5 million hours of computation. Such computing power can only be obtained on supercomputers such as the Jean Zay machine.

Bloom stands out from other language models in that it is trained simultaneously in 46 languages, spread over sources as varied as literature, scientific articles and sports dispatches, and including many languages rarely taken into account, in particular some twenty African languages. The learning corpus even includes computer code! The whole thing amounts to several million books. The more diverse the approach and sources, the more the model is able to perform different tasks. What's more, the data has not been sorted according to language, because, paradoxically, Bloom learns better that way. Agglomerating content in a variety of languages enables robust, high-performance models to be learned for all the languages considered, and often even leads to better results than monolingual models. Another special feature: Bloom's architecture, the list of data used and its learning log will be entirely available in open science, to facilitate research on language models. Finally, Bloom is freely distributed with a responsible license, which explicitly prohibits malicious uses of the model.

"The creation of the Bloom model and the success of the BigScience research collaboration show that another way of creating, studying and sharing innovations in AI is possible, bringing together industrialists, academics and associations around an international, multidisciplinary and open access project. I'm delighted that Hugging Face has been able to find the necessary support in France for this unprecedented approach on a global scale", says Thomas Wolf, co-founder and scientific director of start-up Hugging Face.

"BigScience initiates a world first and paves the way for other scientific breakthroughs. It has benefited from the resources of the Jean Zay converged supercomputer, one of the most powerful in Europe, commissioned in 2019 in the wake of the AI for Humanity plan. Today, more than 1,000 research projects mobilize its resources. Determinant in this success, the Jean Zay extension deployed at the beginning of the year is the result of joint work between the Ministry of Higher Education and Research, the CNRS through the Institut du développement et des ressources en informatique scientifique (Idris), and GENCI", says Philippe Lavocat, President and CEO of GENCI.

"We're delighted with this original public-private partnership, which shows just how essential complementary skills and resources-such as the power of the Jean Zay supercomputer-are for tackling a challenge as important and topical as artificial intelligence research. Behind the scientific breakthrough, we salute the involvement of the Idris staff who enabled this training on the supercomputer, And we welcome the essential role played by the CNRS through the mobilization of the entire automatic language processing community", adds Antoine Petit, President and CEO of the CNRS.

"I am delighted that this international project on one of the current technological frontiers of AI has been supported by the National Strategy for AI, and that the Bloom model will soon be accessible in an open framework. This will enable all innovative players to develop new use cases and applications", emphasizes Jean-Noël Barrot, Minister Delegate for the Digital Economy and Telecommunications.

"The BigScience consortium reflects a public-private collaboration on a global scale exceeding a thousand contributors. Even though these models still require a great deal of scientific investigation, and their energy impact needs to be thoroughly assessed before any scale-up deployment, I'm proud that the French AI ecosystem is hosting such a world-class project", says Sylvie Retailleau, Minister of Higher Education and Research.

Learn more about Bloom
Read on CNRS sites