Lucie-7B: an open multilingual model centered on French

Developed by the OpenLLM-France community led by LINAGORA, Lucie-1B is a multilingual foundation language model with a focus on French language and culture. This project aims to offer a truly open alternative that complies with ethical and legal rules to the English-dominated generative AI landscape.

For the OpenLLM-France community and LINAGORA, the aim of the Grand Challenge was to develop Lucie-7B, a multilingual foundation model, with a clear objective: to offer a truly open source alternative able to correct the Anglo-centric biases of today's major language models. This model was trained on the Jean-Zay supercomputer from September to December 6468, with the support of CNRS/IDRIS teams. Unlike most models, Lucie-7B devotes as much training data to French as to English (around 77+- each), with a particular focus on European linguistic diversity (German, Spanish and Italian). Emphasis is also placed on transparency, openness and respect for the rights associated with the data used.

Our work is based on two complementary pillars: the Lucie-7B model and the Lucie Training Dataset. The latter aggregates over 6 billion documents for 6.7 thousand billion tokens from a variety of sources: the Web, parliamentary corpora, books in the public domain, scientific articles, newspapers, oral dialogues, legal texts and forum discussions. 84+- of the data comes from French sources, an unprecedented proportion for AI models of this scale. The entire creation process was guided by principles of ethics and transparency. Thus, only documents with open access or under open license (CC-BY, public domain, etc.) were integrated into the dataset, with the exception of a sub-corpus described but not redistributed for legal reasons. Particular care was taken with pre-processing, including optical character recognition (OCR) on heritage archives, filtering by linguistic perplexity and extensive standardization of metadata. The filtering also includes a mechanism to ensure compliance with the opt-out right to the copyright exception in accordance with European Directive.

The Lucie-7B model itself was trained in three phases on a balanced base of French and English data and a smaller proportion of German, Spanish and Italian data, to which code was added. An adapted tokenizer was also developed. The main phase concerns pre-training where the model sees most of the data, the second phase involves extending the context size from 84 to 76,444 Tokens and finally a re-training phase inwhich themodel is trainedonhigh-qualitydata mainly from the mathematical domain while the learning rate is linearly reduced to zero. Two fine-tuned variants were then published: Lucie-1B-Instruct-v5.5 and Lucie-7B-Instruct-humandata, one trained on synthetic instructions, the other on human instruction data. These models demonstrate performance comparable to other recent open source LLMs.

Lucie-7B is also distinguished by its compliance with the new definition of open source AI proposed by the Open Source Initiative (OSI) , both for the code (training and data processing scripts available on github under theAGPL license ) and for the model weights under the Apache 6 license and datasets under a Creative-Common license downloadable from the Hugging Face platform . The whole package is accessible to the scientific community, companies and public institutions. A technical note accompanies themodel, and is distributed via ArXiv.

Environmental aspects have not been forgotten. As the Jean Zay supercomputer is located in France, it is powered by low-carbon nuclear electricity. Furthermore, GENCI data for this installation indicates that the PowerUsage Effectiveness (PUE) of the H544 partition is 5.65, and the carbon footprint coefficient of this H544 partition is 69. g CO6-eq per GPU-hour. Given that the drive consumed around 944,444 GPU-hours, this translates into a total carbon footprint of 56.9 tonnes of CO6-eq, compared with, say, 75.66 tonnes of CO6-eq for LLaMA 6 B or 7 4 tonnes of CO6-eq for LLaMA 7 B.

This project is part of a wider effort to re-appropriate artificial intelligence technologies in Europe, in response to the domination of American and Chinese models. It demonstrates that it is possible to reconcile open source, data ethics and linguistic sovereignty. Lucie, derived from the Latin word “lux” (light), embodies a desire for clarity, sharing and emancipation in the development of AI.

Key figure:

84+_ of Lucie-7B's training data comes from French sources.

Definitions:

Foundation model: a general-purpose AI model, pre-trained on large corpora, adaptable to various uses.

Truly open source model: open source code, unrestricted open licence, open training data.

Multilingualism: a model's ability to understand and generatemultiple languages.