3 billion expert transitions across 11 million unique physics-based tasks
We release a dataset of per-level expert trajectories in Kinetix for over 11M unique tasks, totalling around 3B transitions.
Training generalist agents typically requires massive diverse datasets, but most offline RL datasets cover at
most tens of thousands of tasks.
The dataset comes in five splits, varying by environment size (small/medium/large) and expert training budget (1M or 10M steps).
Unique Levels is the number of distinct levels for which trajectories were collected; Transitions is the total number of individual environment steps across all trajectories.
| Expert Training Steps | Size | Unique Levels | Transitions | Size on Disk |
|---|---|---|---|---|
1M |
s |
6M | 1.5B | 123 GB |
1M |
m |
3.5M | 884M | 98 GB |
1M |
l |
1M | 268M | 82 GB |
10M |
s |
637k | 163M | 12 GB |
10M |
m |
422k | 108M | 11 GB |
| Total | 11M | 3B | 326 GB |
With 10M+ unique levels, this dataset makes it possible to study how offline agent performance scales with task diversity, and what challenges occur when having millions of tasks.
Dynamic RenderingWe store raw environment state rather than pre-rendered frames, so the full 3B-transition dataset fits in 326 GB. The rendering function is specified at runtime, meaning the same data can train symbolic or pixel-based agents simply by swapping the renderer.
Easy Evaluations and white-box accessSince we store the raw environment state, we can evaluate agents online from any part of a trajectory, and evaluate how the trained agent performs on (a) the training levels; (b) unseen levels sampled from the same distribution; and (c) the hand-designed set of levels.
We've updated the main Kinetix repository to include
ready-to-use data loaders, with a full example contained in examples/example_data_loading.py.
The dataset is hosted on Hugging Face. Download the entire dataset (~326 GB) or a single split:
Replace 1M/m with any {policy_steps}/{size} combination from the table above.
TrajectoryDatasetManager loads complete episodes.
Each batch has shape (batch_size, T, *dims), with the full env_state included
so you can re-render observations on-the-fly in any observation modality.
A full offline BC training script is provided in experiments/offline_bc.py. Run it as follows, or
modify as needed.
Behaviour cloning on this dataset is an effective way to bootstrap a strong generalist agent. After fewer than 24 GPU-hours of training, a BC agent already outperforms the from-scratch PPO baseline from our original paper. While not as strong as recent developments made in Kinetix, BC remains a compute-efficient way to warm start a policy for online fine-tuning.
We hope this dataset enables cost-effective research into large-scale offline learning and how task diversity influences downstream performance. But the dataset is general enough to be used for other purposes as well, like training world models, level generators and more. If you have questions or want to collaborate, please reach out!
This is based on the Distill Template and the ACCEL Blog. Compute for this work was provided by the Isambard-AI National AI Research Resource, under the project "FLAIR 2025 Moonshot Projects". Thank you to Alex Goldie and Jarek Liesen for useful discussions.