OmniWM Spec · evaluation
Report — protocol + headline numbers

Evaluating a trained world model

Standalone metric runs, the two sampling modes, metric definitions, and the numbers we report on each domain. Standardized eval: 256 fixed val samples (seed 42), 250 DDIM steps, sequential scheduling. HTML rendering of docs/evaluation.md.

00

Quick start

experiment=evaluate_only sets tasks=[evaluate] and validation_size: null (full val set, optionally constrained by a fixed subset):

python src/main.py experiment=evaluate_only \
    dataset=dino_wm/pusht model=nanowm_b2 \
    resume_from_checkpoint=<path/to/ckpt> \
    dataset.loader.validation_fixed_subset_size=256 \
    dataset.loader.validation_fixed_subset_seed=42

Outputs under ${RESULTS_DIR}/<run_dir>/: eval_videos/ (GT-vs-prediction MP4s), metrics.json (PSNR / SSIM / LPIPS / FID, FVD if enough samples), and the composed config in .hydra/.

If you already have rollout videos, compute metrics directly and plot comparisons:

python src/sample/evaluate_metrics.py \
    --video_dir /path/to/rollout_results --history_length 1 --output_csv metrics.csv

python src/sample/plot_metrics.py \
    --csvs metrics_h1.csv metrics_h2.csv metrics_h3.csv --output rollout_comparison.png
01

Sampling modes

model.scheduling_modeBehaviorSteps
sequential (default)Frame-by-frame autoregressive denoising. Highest quality.250 (default)
full_sequenceDenoise all frames jointly (DDIM over the whole clip). Faster, slightly lower quality.50 is sensible
python src/main.py experiment=evaluate_only ... \
    model.scheduling_mode=full_sequence model.num_sampling_steps=50
02

Metric definitions

All four metrics are computed per-clip and averaged.

MetricDirectionNotes
PSNRper-pixel MSE, in dB
SSIMstructural similarity, [0, 1]
LPIPSlearned perceptual distance (AlexNet)
FIDFréchet Inception Distance via i3d torchscript

For longer-horizon videos with enough samples, FVD is also computed. The i3d model resolves from ${PRETRAINED_MODELS_DIR}/i3d/i3d_torchscript.pt with a relative fallback pretrained_models/i3d/i3d_torchscript.pt.

03

Results on shipped checkpoints

256 fixed val samples (seed 42), 250 DDIM steps, sequential scheduling.

DINO-WM env (NanoWM-B/2)StepsPSNR ↑SSIM ↑LPIPS ↓FID ↓
Point Maze30k36.740.9840.0199.66
Wall15k34.050.9940.0102.64
PushT100k33.190.9820.01613.63
Rope15k31.630.9530.05635.20
Granular15k26.080.9170.07340.05
RT-1 (fractal)StepsPSNR ↑SSIM ↑LPIPS ↓FID ↓
NanoWM-B/2300k24.360.7870.18035.08
04

Reproducing these numbers

Each row is a single eval command. Replace <ckpt> with the downloaded HF checkpoint and <dataset> with the matching dino_wm/{point_maze,wall,pusht,rope,granular} or rt1/rt1:

python src/main.py experiment=evaluate_only \
    dataset=<dataset> model=nanowm_b2 \
    resume_from_checkpoint=<ckpt> \
    dataset.loader.validation_fixed_subset_size=256 \
    dataset.loader.validation_fixed_subset_seed=42

Per-axis ablation numbers live in the training spec. Success-rate (planning) evaluation and long-horizon rollout are covered by docs/applications/planning.md and docs/applications/long_rollout.md.