AION-1: A multimodal foundation model for astronomy

Faced with the explosion of astronomical data (hundreds of terabytes, hundreds of millions of objects), AION-1 unifies, for the first time, the analysis of heterogeneous observations in a single model of over 3 billion parameters. This foundational model, trained on Jean Zay, allows astrophysicists to interpret data from different instruments and can be easily adapted to a multitude of scientific tasks.

Modern astronomy has entered an era of data abundance thanks to a series of ground-based and space-based observatories that systematically observe the sky. While artificial intelligence (AI) has become essential for analyzing this profusion of data, current methods remain siloed, with each model specialized for a specific data type and task, hindering the deployment of AI at scale and systematically.

Our project breaks through this bottleneck with AION-1, a foundation model capable of ingesting different types of astronomical data and easily adaptable to new tasks. This model thus provides observational astrophysics with a unique, multimodal and reusable foundation that treats instrumental diversity as an asset rather than an obstacle.

Unifying Astronomical Data for Better Analysis

AION-1's training corpus brings together 120 terabytes of data covering over 200 million astronomical objects, from nearby stars to distant quasars. These data come from five major surveys: Legacy Survey (122 million images), Hyper Suprime-Cam (2.5 million images), SDSS (8 million spectra), DESI (1 million high-resolution spectra), and the Gaia space telescope (77 million low-resolution spectra). This instrumental and astrophysical diversity serves a broad scientific community but presents a significant challenge: How to integrate these heterogeneous data (2D images, 1D spectra, scalar measurements) into a single neural architecture?

Our solution relies on universal "tokenization": each modality passes through a specialized neural encoder that transforms it into a one-dimensional sequence of discrete tokens, analogous to the lexical units of language models. Multichannel images become sequences of visual tokens, spectra are compressed into spectral tokens, scalars quantized into numerical tokens. Each token retains its provenance annotation (modality × instrument), thus preserving crucial contextual information.

Distilling Scientific Data into a Transformer Model

The obtained tokens feed a Transformer encoder-decoder, an architecture that revolutionized natural language processing. This "any-to-any" architecture accepts any combination of multimodal tokens as input and can generate any modality as output. This allows the model to learn complex relationships between images, spectra, and scalar measurements.

Training uses a multimodal masked learning strategy: for each astronomical object, we randomly mask a fraction of available tokens and ask the model to predict them from the remaining context. This approach forces cross-modal semantic integration and allows simultaneous learning of all possible conditional distributions. Once trained, AION-1 can thus predict a galaxy's spectrum from its image, or vice versa. Beyond these generative capabilities, the model produces highquality latent representations (embeddings), directly exploitable for numerous scientific applications.

Computational Challenges of Scaling Up

To study the scaling laws of this type of multimodal model (still poorly explored compared to language models), we developed a family of models of increasing sizes: AION-1-B (300M parameters, 64 GPUs), AION-1-L (800M, 100 GPUs), AION-1-XL (3.1 billion, 288 GPUs), up to experimental AION-1-XXL (11 billion, 512 GPUs). This scaling required particular attention to parallelization strategy to maintain computational efficiency.

Stabilizing multimodal learning requires batches of 8192 samples, imposing massive parallelization via PyTorch FSDP with the ZeRO-2 strategy. Beyond data parallelism, this approach distributes gradients and optimizer states across all GPUs, enabling training of models far exceeding single accelerator memory. Jean Zay's high-performance interconnection proved crucial for maintaining efficiency despite intensive inter-GPU communications. Training AION-1- XL on 288 GPUs consumed approximately 50,000 compute hours.

Performance and Scientific Validation

Applying a foundation model to scientific tasks requires rigorous calibration, as assumptions encoded during training may not apply to the specific problem. We therefore systematically use AION-1 after adaptation on a representative sample of the target task.

AION-1's embeddings achieve remarkable performance with a simple linear classifier. For estimating galactic properties (stellar mass, age, metallicity, star formation rate), the model surpasses specialized approaches. In galaxy morphological classification, AION-1 significantly outperforms generalist vision models like Meta's DINOv2. More generally, whether for semantic segmentation of galactic structures or stellar parameter estimation, performance equals or exceeds specialized models.

The major advantage remains efficiency in the low-data regime: AION-1 adapts to new tasks with only 100 to 1000 examples, where traditional approaches require tens of thousands. This property is crucial for astronomy where expert annotation remains expensive and limited.

Outlook

AION-1 is to date the largest AI model ever trained in astrophysics, and more broadly in physics. Beyond the technical achievement, it demonstrates the feasibility of a unified approach for analyzing heterogeneous scientific data. This approach paves the way for future large surveys like the Vera Rubin Observatory, which will generate 20 TB of data per night, or the Euclid mission, which will map one billion galaxies. The computational investment made on Jean Zay thus positions France at the forefront of modern astronomy's digital transformation.

Key figure:

Trained on 120 terabytes of data, 3 billion parameters, 200 million celestial objects.

Definitions :

Foundation model: "A pre-trained AI model on massive datasets, general-purpose and adaptable to multiple tasks without complete retraining, like ChatGPT."
Tokenization: "A process that transforms complex data (images, spectra) into sequences of discrete symbols understandable by AI."
Transformer: "An AI architecture that uses attention mechanisms to process sequences of data and model their relationships."
Embeddings: "Vector representations of complex data that capture their main features."