Autonomous driving remains a major technological and scientific challenge, at the crossroads of artificial intelligence, robotics, and road safety. For over a decade, two main approaches have dominated: on the one hand, reinforcement learning in simulation, where agents explore virtual environments like CARLA to learn navigation through reward signals; on the other hand, imitation learning from human experts, where models replicate the decisions of real drivers using large annotated datasets.
These strategies have driven considerable progress, but they face two major obstacles. First, simulators struggle to capture the full richness and diversity of the real world, making the transition from simulation to the road extremely difficult. Second, imitation-based approaches require costly annotated data: thousands of hours of driving enriched with HD maps, GPS trajectories, and object detections. Even then, rare but critical scenarios — a pedestrian darting out between cars, a vehicle driving the wrong way, or a sudden obstacle — remain underrepresented in datasets.
Our project proposes an alternative: learning to drive directly from raw YouTube videos. We leverage OpenDV, a corpus of more than 1,700 hours of dashcam footage uploaded by individuals worldwide. These videos have no calibration, no annotation, no complex processing: they capture the road as it is, with traffic lights, weather changes, driver behaviors, and unexpected events. This approach enables scaling up by capitalizing on the abundance of data available online.
The intuition behind our approach is the following: if a video generative model can accurately predict the continuation of a sequence, it must necessarily have internalized the implicit rules of the world — the physics of moving objects, vehicle dynamics, traffic regularities, and human behavior. Such a model does not merely imitate: it learns a rich, implicit representation of the road. Building on this foundation, we propose an end-to-end driving system structured around two complementary modules.
The first, called VaViM, is a pretrained video generative model. It learns from millions of dashcam sequences to predict future frames, discovering in an unsupervised way the spatial and temporal structure of driving scenes. The second, VaVAM, is an action model that reuses these visual representations to generate realistic driving trajectories. By combining these two building blocks, we obtain a complete system that directly transforms video frames into driving decisions.
A central aspect of our research is scaling. We trained several versions of our models, ranging from a few hundred million to 1.2 billion parameters, and studied their scaling laws — empirical relationships between model size, data volume, and performance. Our results confirm a welldocumented trend in language and vision: performance steadily improves with scale. Concretely, the more videos and larger models we use, the better the video predictions, and these gains translate into stronger driving performance. This suggests that progress toward truly safe and robust systems will not only come from algorithmic improvements but also from leveraging massive volumes of raw data.
To validate our approach, we used NeuroNCAP, a new benchmark that simulates critical scenarios in closed loop via neural rendering. Unlike traditional simulators, NeuroNCAP recreates photorealistic scenes from real data and integrates adversarial agents to test vehicle reactions. In this demanding setting, our models, trained exclusively on raw videos, achieve competitive performance on safety scenarios, particularly frontal collisions. Even more, they display an emergent ability to anticipate and avoid dangers through realistic maneuvers — without any explicit rules being coded.
This project shows that with freely available Internet videos, it is possible to develop generative models at scale capable of powering robust driving intelligence. This approach reduces reliance on costly annotations, provides geographically and culturally diverse data, and paves the way for more generalizable and safer driving systems. It also lays the groundwork for future research: integrating multiple cameras, leveraging pseudo-GPS labels to better align trajectories and videos, and exploring even larger architectures.
In short, our work demonstrates that learning to drive “by watching YouTube” is not a utopia. It is a serious scientific avenue, where scaling, raw data, and generative models together unlock new perspectives for the autonomous driving of tomorrow.
Key figure : 1,800 hours of videos used to pretrain the models with no manual annotation required.
definition modules :
Generative pretraining: learning to predict the continuation of a video, without human labels, to capture world dynamics.
Scaling laws: empirical relation between model size, data volume, and performance.
Closed-loop evaluation: protocol where model decisions directly affect how the simulation evolves.