Pleias 1.0: Open Language Models for the Public Good

Pleias 1.0 is the first family of language models trained exclusively on open data under a license that permits redistribution. Using the Jean Zay supercomputer, this project demonstrates that it is possible to reconcile performance, energy efficiency, and open science.

The development of large language models (LLMs) today relies on massive volumes of data collected without oversight, often protected by copyright and difficult to audit. This opacity limits their adoption in regulated sectors (healthcare, finance, public services), where traceability and legal compliance are essential.

In this context, the Pleias 1.0 project offers an alternative: a family of European language models trained exclusively on open, traceable, and reusable data. The project builds on Common Corpus, the largest public database ever assembled (approximately 2 trillion tokens), and the computational resources of the Jean Zay supercomputer through the Grand Challenge program.

AN UNPRECEDENTED APPROACH

Pleias 1.0 combines several structural innovations:

A massive corpus that is entirely public and documented, with reuse metadata at the document level.
Distributed training on Jean Zay using Hugging Face's Nanotron library, adapted and enhanced for this project.
Models ranging from 350M to 1.2B parameters, optimized for targeted uses (particularly retrieval-augmented generation, or RAG).
A balanced multilingual tokenizer ensuring better coverage of European languages compared to existing alternatives.

This approach makes the entire pipeline—data, training, models—compatible with EU AI Act requirements for transparency, documentation, and governance.

THREE VALIDATED CONTRIBUTIONS

A truly open and traceable corpus The Common Corpus represents a major breakthrough: no private or copyrighted data, but literary, scientific, administrative texts or open-source code. Each document comes with metadata describing its reuse conditions, guaranteeing unprecedented legal security for LLM training.
Efficient and reproducible models Thanks to distributed computing efficiency on Jean Zay, we demonstrated that it's possible to train performant models at reduced cost. The carbon footprint of a 1.2B model reaches only 4 tons CO₂ equivalent, several dozen times less than current industry standards.
Pioneering use of synthetic data for RAG Complementing the open corpus, we generated several million examples of synthetic data, also fully compliant with European copyright law, to anticipate future uses. These scenarios include adversarial dialogues, language mixing, and source-guided reasoning. They constitute an experimental form of synthetic playground enabling small models and agents to achieve robustness comparable to much larger LLMs, particularly in RAG tasks where source citation and response verification are essential.

IMPACT

CommonCorpus: An Essential Dataset for the European Open Ecosystem

As early as August 2024, the first version of Common Corpus was highlighted as a "massive" part of the pre-training data commons for European languages, notably contributing to making French a high-resource language (Ali, 2024). At the time of writing, Common Corpus ranks among the top 3 most downloaded datasets on HuggingFace, with over 700k downloads since its creation.

In February 2025, Common Corpus was recognized as one of the main deliverables of the Paris AI Summit as the "largest open database for training large language models."

Common Corpus has already been integrated into a wide range of LLM projects. By September 2025, at least seven European SLMs and LLMs (Salamandra, Apertus, Nvidia's Neko...) have been trained on at least part of Common Corpus, with more to follow throughout 2025. Common Corpus is also used to create new datasets. For example, YouTube Commons was used to create two important multimodal datasets: FineVideo (43,000 YouTube videos) and Mosel (1 million audio hours, half from YouTube Commons) (Gaido et al., 2024), which was used to train the FAMA series of foundational speech recognition models (Papi et al., 2025).

Pleias 1.0: Frugal Models for Generative AI for the Public Good

Pleias-RAG-350m and Pleias-RAG-1B outperform SLMs under 4 billion parameters on standardized RAG benchmarks (HotPotQA, 2wiki) and are competitive with popular larger models, including Qwen2.5-7B, Llama-3.1-8B, and Gemma-3-4B.

These are the only SLMs to date that maintain consistent RAG performance across major European languages and guarantee systematic grounding of references for statements.

Due to their size and ease of deployment on constrained infrastructure as well as their greater factuality by design, these models open the way to a whole range of new use cases for generative AI.

By September 2025, these models are used in production for a wide range of use cases, particularly those related to AI for public good. Indeed, by primarily leveraging the models' reasoning capabilities while keeping them extremely small, we enable efficient deployment in contexts with weak or nonexistent IT infrastructure. For example, Pleias-RAG models are currently deployed as legal assistants on Raspberry Pi 4 (8GB RAM) to serve field experts working with victims of sexual violence in the DRC and Ukraine.

Perspectives and Applications

Pleias 1.0 opens new perspectives for:

Regulated sectors (healthcare, finance, public administrations), where AI Act compatibility and documentary traceability constitute decisive assets.
Sovereign environments, enabling model deployment on local or national infrastructure without dependence on foreign actors.
Scientific research, which benefits from a traceable, reproducible, and freely accessible corpus, facilitating comparisons between approaches.

Conclusion

By combining data openness, methodological innovation, and energy efficiency, Pleias 1.0 demonstrates that an alternative path for generative AI is possible. It meets the expectations of regulated sectors and European public services while laying the groundwork for a sovereign ecosystem aligned with the general interest and regulatory constraints.

Key Figure:

2 trillion tokens The size of Common Corpus, the first and largest global reproducible LLM training database, developed on Jean Zay.

Definition 1 :

Synthetic data : Text generated by other language models to supplement existing corpora and model tasks.

Definition 2 :

Tokenizer : Tool that breaks text into elementary units ("tokens") to enable model learning.