3 billion expert transitions across 11 million unique physics-based tasks

TL; DR

We release a dataset of per-level expert trajectories in Kinetix for over 11M unique tasks, totalling around 3B transitions.

Introduction

Training generalist agents typically requires massive diverse datasets, but most offline RL datasets cover at most tens of thousands of tasks.A notable example is XLand-100B, which contains full learning trajectories from 30,000 tasks. Here we release a dataset of 3 billion expert transitions across 11 million unique physics-based tasks, enabling offline RL at a scale not previously available. All tasks come from Kinetix, a procedurally generated 2D physics environment where the goal is to make the green and blue objects touch without green touching red. As a first demonstration of what this enables, we show that behaviour cloning followed by PPO fine-tuning reaches strong performance at a fraction of the compute required to train PPO from scratch.

Dataset

The dataset comes in five splits, varying by environment size (small/medium/large) and expert training budget (1M or 10M steps).

s m l

Dataset Collection Process

We train specialist RL agents per level for a fixed number of timesteps and then collect a single trajectory per level. Since not all environments are solvable, these experts do not always succeed in the task. For the released dataset, we select only the optimal trajectories from solvable tasks (about 50% of tasks are solvable).

Dataset Details

Here are the details of the different datasets we collected using this process.

Unique Levels is the number of distinct levels for which trajectories were collected; Transitions is the total number of individual environment steps across all trajectories.

Expert Training Steps Size Unique Levels Transitions Size on Disk
1M s 6M 1.5B 123 GB
1M m 3.5M 884M 98 GB
1M l 1M 268M 82 GB
10M s 637k 163M 12 GB
10M m 422k 108M 11 GB
Total 11M 3B 326 GB

Why use this dataset?

Massive Task Diversity

With 10M+ unique levels, this dataset makes it possible to study how offline agent performance scales with task diversity, and what challenges occur when having millions of tasks.

Dynamic Rendering

We store raw environment state rather than pre-rendered frames, so the full 3B-transition dataset fits in 326 GB. The rendering function is specified at runtime, meaning the same data can train symbolic or pixel-based agents simply by swapping the renderer.

Easy Evaluations and white-box access

Since we store the raw environment state, we can evaluate agents online from any part of a trajectory, and evaluate how the trained agent performs on (a) the training levels; (b) unseen levels sampled from the same distribution; and (c) the hand-designed set of levels.

Usage

We've updated the main Kinetix repository to include ready-to-use data loaders, with a full example contained in examples/example_data_loading.py.

Downloading the Dataset

The dataset is hosted on Hugging Face. Download the entire dataset (~326 GB) or a single split:

# Entire dataset (~326 GB) hf download mbeukman/Kinetix-Offline \ --repo-type dataset \ --local-dir ./data # Single split, e.g. 10M-step experts, medium size (~11 GB) hf download mbeukman/Kinetix-Offline \ --repo-type dataset \ --local-dir ./data \ --include "10M/m/*"

Replace 1M/m with any {policy_steps}/{size} combination from the table above.

Loading Data

TrajectoryDatasetManager loads complete episodes. Each batch has shape (batch_size, T, *dims), with the full env_state included so you can re-render observations on-the-fly in any observation modality.

from kinetix.data import TrajectoryDatasetManager from kinetix.environment import EnvParams, static_env_params_from_size from kinetix.render import make_render_pixels import jax static_env_params = static_env_params_from_size("m") dataset = TrajectoryDatasetManager( dataset_dir="/path/to/traj_data", batch_size=256, val_batch_size=256, seed=0, ) batch = dataset.load_next_batch() # batch.env_state holds the full simulator state at every timestep, # so you can render pixels at training time without storing raw frames: renderer = jax.jit(make_render_pixels(EnvParams(), static_env_params)) frames = jax.vmap(jax.vmap(renderer))(batch.env_state) # (B, T, H, W, C) # other fields # batch.action (B, T, 6): discrete action for each of the 4 joints and 2 thrusters # batch.action_mask (B, T, 6): which joints/thrusters are active # batch.done (B, T): True at episode end

Training a BC agent

A full offline BC training script is provided in experiments/offline_bc.py. Run it as follows, or modify as needed.

python experiments/offline_bc.py \ dataset_dir=/path/to/data \ learning.lr=3e-4 \ env_size=m

Results

Behaviour cloning on this dataset is an effective way to bootstrap a strong generalist agent. After fewer than 24 GPU-hours of training, a BC agent already outperforms the from-scratch PPO baseline from our original paper. While not as strong as recent developments made in Kinetix, BC remains a compute-efficient way to warm start a policy for online fine-tuning.

Solve rate on the randomly sampled m environments, across two different network architectures. The dotted red line represents performance after 1 trillion timesteps of from-scratch PPO.

Conclusion

We hope this dataset enables cost-effective research into large-scale offline learning and how task diversity influences downstream performance. But the dataset is general enough to be used for other purposes as well, like training world models, level generators and more. If you have questions or want to collaborate, please reach out!

Acknowledgements

This is based on the Distill Template and the ACCEL Blog. Compute for this work was provided by the Isambard-AI National AI Research Resource, under the project "FLAIR 2025 Moonshot Projects". Thank you to Alex Goldie and Jarek Liesen for useful discussions.