Aller au contenu principal

Menu secondaire

  • Actualités
  • Évènements
  • Newsletter

Réseaux sociaux

  • LinkedIn
  • Twitter
  • YouTube
  • Flux RSS
  • Contact
    • EN
Logo GENCI
Logo GENCI

Menu principal

  • Connaître GENCI
    • Qui sommes-nous ?
    • Notre écosystème
    • HPC.IA.Quantique
    • Vers l'exascale
    • Nos rapports et publications
  • Services
    • Moyens de calcul
    • Pour les chercheurs académiques
    • Pour les entreprises
    • Formation
    • Nous contacter
  • Résultats scientifiques
    • Focus scientifiques

Je souhaite...

  • Déposer ou renouveler une demande de ressources
  • M'informer sur le calcul haute performance, quantique et IA
  • Contacter GENCI

Fil d'Ariane

  1. Accueil
  2. Focus scientifiques
  3. AION-1: A multimodal foundation model for astronomy

AION-1: A multimodal foundation model for astronomy

Faced with the explosion of astronomical data (hundreds of terabytes, hundreds of millions of objects), AION-1 unifies, for the first time, the analysis of heterogeneous observations in a single model of over 3 billion parameters. This foundational model, trained on Jean Zay, allows astrophysicists to interpret data from different instruments and can be easily adapted to a multitude of scientific tasks.

03 novembre 2025

Modern astronomy has entered an era of data abundance thanks to a series of ground-based and space-based observatories that systematically observe the sky. While artificial intelligence (AI) has become essential for analyzing this profusion of data, current methods remain siloed, with each model specialized for a specific data type and task, hindering the deployment of AI at scale and systematically. 

Our project breaks through this bottleneck with AION-1, a foundation model capable of ingesting different types of astronomical data and easily adaptable to new tasks. This model thus provides observational astrophysics with a unique, multimodal and reusable foundation that treats instrumental diversity as an asset rather than an obstacle. 

Unifying Astronomical Data for Better Analysis 

AION-1's training corpus brings together 120 terabytes of data covering over 200 million astronomical objects, from nearby stars to distant quasars. These data come from five major surveys: Legacy Survey (122 million images), Hyper Suprime-Cam (2.5 million images), SDSS (8 million spectra), DESI (1 million high-resolution spectra), and the Gaia space telescope (77 million low-resolution spectra). This instrumental and astrophysical diversity serves a broad scientific community but presents a significant challenge: How to integrate these heterogeneous data (2D images, 1D spectra, scalar measurements) into a single neural architecture? 

Our solution relies on universal "tokenization": each modality passes through a specialized neural encoder that transforms it into a one-dimensional sequence of discrete tokens, analogous to the lexical units of language models. Multichannel images become sequences of visual tokens, spectra are compressed into spectral tokens, scalars quantized into numerical tokens. Each token retains its provenance annotation (modality × instrument), thus preserving crucial contextual information. 

Distilling Scientific Data into a Transformer Model 

The obtained tokens feed a Transformer encoder-decoder, an architecture that revolutionized natural language processing. This "any-to-any" architecture accepts any combination of multimodal tokens as input and can generate any modality as output. This allows the model to learn complex relationships between images, spectra, and scalar measurements. 

Training uses a multimodal masked learning strategy: for each astronomical object, we randomly mask a fraction of available tokens and ask the model to predict them from the remaining context. This approach forces cross-modal semantic integration and allows simultaneous learning of all possible conditional distributions. Once trained, AION-1 can thus predict a galaxy's spectrum from its image, or vice versa. Beyond these generative capabilities, the model produces highquality latent representations (embeddings), directly exploitable for numerous scientific applications. 

Computational Challenges of Scaling Up 

To study the scaling laws of this type of multimodal model (still poorly explored compared to language models), we developed a family of models of increasing sizes: AION-1-B (300M parameters, 64 GPUs), AION-1-L (800M, 100 GPUs), AION-1-XL (3.1 billion, 288 GPUs), up to experimental AION-1-XXL (11 billion, 512 GPUs). This scaling required particular attention to parallelization strategy to maintain computational efficiency. 

Stabilizing multimodal learning requires batches of 8192 samples, imposing massive parallelization via PyTorch FSDP with the ZeRO-2 strategy. Beyond data parallelism, this approach distributes gradients and optimizer states across all GPUs, enabling training of models far exceeding single accelerator memory. Jean Zay's high-performance interconnection proved crucial for maintaining efficiency despite intensive inter-GPU communications. Training AION-1- XL on 288 GPUs consumed approximately 50,000 compute hours. 

Performance and Scientific Validation 

Applying a foundation model to scientific tasks requires rigorous calibration, as assumptions encoded during training may not apply to the specific problem. We therefore systematically use AION-1 after adaptation on a representative sample of the target task. 

AION-1's embeddings achieve remarkable performance with a simple linear classifier. For estimating galactic properties (stellar mass, age, metallicity, star formation rate), the model surpasses specialized approaches. In galaxy morphological classification, AION-1 significantly outperforms generalist vision models like Meta's DINOv2. More generally, whether for semantic segmentation of galactic structures or stellar parameter estimation, performance equals or exceeds specialized models. 

The major advantage remains efficiency in the low-data regime: AION-1 adapts to new tasks with only 100 to 1000 examples, where traditional approaches require tens of thousands. This property is crucial for astronomy where expert annotation remains expensive and limited. 

Outlook 

AION-1 is to date the largest AI model ever trained in astrophysics, and more broadly in physics. Beyond the technical achievement, it demonstrates the feasibility of a unified approach for analyzing heterogeneous scientific data. This approach paves the way for future large surveys like the Vera Rubin Observatory, which will generate 20 TB of data per night, or the Euclid mission, which will map one billion galaxies. The computational investment made on Jean Zay thus positions France at the forefront of modern astronomy's digital transformation.

Key figure: 

Trained on 120 terabytes of data, 3 billion parameters, 200 million celestial objects.

Definitions : 

  • Foundation model: "A pre-trained AI model on massive datasets, general-purpose and adaptable to multiple tasks without complete retraining, like ChatGPT."
  • Tokenization: "A process that transforms complex data (images, spectra) into sequences of discrete symbols understandable by AI."
  • Transformer: "An AI architecture that uses attention mechanisms to process sequences of data and model their relationships."
  • Embeddings: "Vector representations of complex data that capture their main features."

 

Partager

Domaine scientifique

  • CT10 : Intelligence artificielle et applications transversales du calcul

Équipe

  • François Lanusse

    , Université Paris-Saclay, Université Paris Cité, CEA, CNRS, AIM

  • Liam Parker

    , University of California, Berkeley

  • Hatim Bourfoune

    , IDRIS, CNRS

  • Micah Bowles

    , University of Oxford

  • Nathan Cassereau

    , IDRIS, CNRS

  • Pierre Cornette

    , IDRIS, CNRS

  • Miles Cranmer

    , University of Cambridge

  • Tom Hehir

    , University of Cambridge

  • Shirley Ho

    , Flatiron Institute, New York University

  • Geraud Krawezik

    , Flatiron Institute

  • Ollie Liu

    , University of Southern California

  • Lucas Meyer

    , Polymathic AI

  • Jeff Shen

    , Princeton University

  • Sebastian Wagner-Carena

    , New York University

Organisation(s)

Université Paris-Saclay
Université Paris Cité
CEA
CNRS
AIM
IDRIS
University of California
Berkeley
University of Oxford
University of Cambridge
Flatiron Institute
New York University
University of Southern California
Polymathic AI
Princeton University

Ressources utilisées

: 200 kh

Année d'attribution

  • 2024

6 bis rue Auguste Vitu

75015 PARIS

+33 1 42 50 04 15

Menu footer

  • Nous rejoindre
  • Marchés publics
  • Les newsletters
  • Mentions légales
  • Plan du site
  • Cookies

Nous suivre

Réseaux sociaux

  • LinkedIn
  • Twitter
  • YouTube
  • Flux RSS