Introduction

Training generalist agents typically requires massive diverse datasets, but most offline RL datasets cover at most tens of thousands of tasks.A notable example is XLand-100B, which contains full learning trajectories from 30,000 tasks. Here we release a dataset of 3 billion expert transitions across 11 million unique physics-based tasks, enabling offline RL at a scale not previously available. All tasks come from Kinetix, a procedurally generated 2D physics environment where the goal is to make the green and blue objects touch without green touching red. As a first demonstration of what this enables, we show that behaviour cloning followed by PPO fine-tuning reaches strong performance at a fraction of the compute required to train PPO from scratch.

Dataset

The dataset comes in five splits, varying by environment size (small/medium/large) and expert training budget (1M or 10M steps).

Dataset Collection Process

We train specialist RL agents per level for a fixed number of timesteps and then collect a single trajectory per level. Since not all environments are solvable, these experts do not always succeed in the task. For the released dataset, we select only the optimal trajectories from solvable tasks (about 50% of tasks are solvable).

Dataset Details

Here are the details of the different datasets we collected using this process.

Unique Levels is the number of distinct levels for which trajectories were collected; Transitions is the total number of individual environment steps across all trajectories.

Expert Training Steps	Size	Unique Levels	Transitions	Size on Disk
`1M`	`s`	6M	1.5B	123 GB
`1M`	`m`	3.5M	884M	98 GB
`1M`	`l`	1M	268M	82 GB
`10M`	`s`	637k	163M	12 GB
`10M`	`m`	422k	108M	11 GB
Total		11M	3B	326 GB

Why use this dataset?

Massive Task Diversity

With 10M+ unique levels, this dataset makes it possible to study how offline agent performance scales with task diversity, and what challenges occur when having millions of tasks.

Dynamic Rendering

We store raw environment state rather than pre-rendered frames, so the full 3B-transition dataset fits in 326 GB. The rendering function is specified at runtime, meaning the same data can train symbolic or pixel-based agents simply by swapping the renderer.

Easy Evaluations and white-box access

Since we store the raw environment state, we can evaluate agents online from any part of a trajectory, and evaluate how the trained agent performs on (a) the training levels; (b) unseen levels sampled from the same distribution; and (c) the hand-designed set of levels.

Usage

We've updated the main Kinetix repository to include ready-to-use data loaders, with a full example contained in examples/example_data_loading.py.

Downloading the Dataset

The dataset is hosted on Hugging Face. Download the entire dataset (~326 GB) or a single split:

# Entire dataset (~326 GB) hf download mbeukman/Kinetix-Offline \ --repo-type dataset \ --local-dir ./data # Single split, e.g. 10M-step experts, medium size (~11 GB) hf download mbeukman/Kinetix-Offline \ --repo-type dataset \ --local-dir ./data \ --include "10M/m/*"

Replace 1M/m with any {policy_steps}/{size} combination from the table above.

Loading Data

TrajectoryDatasetManager loads complete episodes. Each batch has shape (batch_size, T, *dims), with the full env_state included so you can re-render observations on-the-fly in any observation modality.

from kinetix.data import TrajectoryDatasetManager from kinetix.environment import EnvParams, static_env_params_from_size from kinetix.render import make_render_pixels import jax static_env_params = static_env_params_from_size("m") dataset = TrajectoryDatasetManager( dataset_dir="/path/to/traj_data", batch_size=256, val_batch_size=256, seed=0, ) batch = dataset.load_next_batch() # batch.env_state holds the full simulator state at every timestep, # so you can render pixels at training time without storing raw frames: renderer = jax.jit(make_render_pixels(EnvParams(), static_env_params)) frames = jax.vmap(jax.vmap(renderer))(batch.env_state) # (B, T, H, W, C) # other fields # batch.action (B, T, 6): discrete action for each of the 4 joints and 2 thrusters # batch.action_mask (B, T, 6): which joints/thrusters are active # batch.done (B, T): True at episode end

Training a BC agent

A full offline BC training script is provided in experiments/offline_bc.py. Run it as follows, or modify as needed.

python experiments/offline_bc.py \ dataset_dir=/path/to/data \ learning.lr=3e-4 \ env_size=m

TL; DR