Evaluating a trained world model

Standalone metric runs, the two sampling modes, metric definitions, and the numbers we report on each domain. Standardized eval: 256 fixed val samples (seed 42), 250 DDIM steps, sequential scheduling. HTML rendering of docs/evaluation.md.

Quick start

experiment=evaluate_only sets tasks=[evaluate] and validation_size: null (full val set, optionally constrained by a fixed subset):

python src/main.py experiment=evaluate_only \
    dataset=dino_wm/pusht model=nanowm_b2 \
    resume_from_checkpoint=<path/to/ckpt> \
    dataset.loader.validation_fixed_subset_size=256 \
    dataset.loader.validation_fixed_subset_seed=42

Outputs under ${RESULTS_DIR}/<run_dir>/: eval_videos/ (GT-vs-prediction MP4s), metrics.json (PSNR / SSIM / LPIPS / FID, FVD if enough samples), and the composed config in .hydra/.

If you already have rollout videos, compute metrics directly and plot comparisons:

python src/sample/evaluate_metrics.py \
    --video_dir /path/to/rollout_results --history_length 1 --output_csv metrics.csv

python src/sample/plot_metrics.py \
    --csvs metrics_h1.csv metrics_h2.csv metrics_h3.csv --output rollout_comparison.png

Sampling modes

`model.scheduling_mode`	Behavior	Steps
`sequential` (default)	Frame-by-frame autoregressive denoising. Highest quality.	250 (default)
`full_sequence`	Denoise all frames jointly (DDIM over the whole clip). Faster, slightly lower quality.	50 is sensible

python src/main.py experiment=evaluate_only ... \
    model.scheduling_mode=full_sequence model.num_sampling_steps=50

Metric definitions

All four metrics are computed per-clip and averaged.

Metric	Direction	Notes
PSNR	↑	per-pixel MSE, in dB
SSIM	↑	structural similarity, [0, 1]
LPIPS	↓	learned perceptual distance (AlexNet)
FID	↓	Fréchet Inception Distance via i3d torchscript

For longer-horizon videos with enough samples, FVD is also computed. The i3d model resolves from ${PRETRAINED_MODELS_DIR}/i3d/i3d_torchscript.pt with a relative fallback pretrained_models/i3d/i3d_torchscript.pt.

Results on shipped checkpoints

256 fixed val samples (seed 42), 250 DDIM steps, sequential scheduling.

DINO-WM env (NanoWM-B/2)	Steps	PSNR ↑	SSIM ↑	LPIPS ↓	FID ↓
Point Maze	30k	36.74	0.984	0.019	9.66
Wall	15k	34.05	0.994	0.010	2.64
PushT	100k	33.19	0.982	0.016	13.63
Rope	15k	31.63	0.953	0.056	35.20
Granular	15k	26.08	0.917	0.073	40.05

RT-1 (fractal)	Steps	PSNR ↑	SSIM ↑	LPIPS ↓	FID ↓
NanoWM-B/2	300k	24.36	0.787	0.180	35.08

Reproducing these numbers

Each row is a single eval command. Replace <ckpt> with the downloaded HF checkpoint and <dataset> with the matching dino_wm/{point_maze,wall,pusht,rope,granular} or rt1/rt1:

python src/main.py experiment=evaluate_only \
    dataset=<dataset> model=nanowm_b2 \
    resume_from_checkpoint=<ckpt> \
    dataset.loader.validation_fixed_subset_size=256 \
    dataset.loader.validation_fixed_subset_seed=42

Per-axis ablation numbers live in the training spec. Success-rate (planning) evaluation and long-horizon rollout are covered by docs/applications/planning.md and docs/applications/long_rollout.md.