Standalone metric runs, the two sampling modes, metric definitions, and the numbers we
report on each domain. Standardized eval: 256 fixed val samples (seed 42), 250 DDIM
steps, sequential scheduling. HTML rendering of docs/evaluation.md.
experiment=evaluate_only sets tasks=[evaluate] and
validation_size: null (full val set, optionally constrained by a fixed subset):
python src/main.py experiment=evaluate_only \
dataset=dino_wm/pusht model=nanowm_b2 \
resume_from_checkpoint=<path/to/ckpt> \
dataset.loader.validation_fixed_subset_size=256 \
dataset.loader.validation_fixed_subset_seed=42
Outputs under ${RESULTS_DIR}/<run_dir>/: eval_videos/
(GT-vs-prediction MP4s), metrics.json (PSNR / SSIM / LPIPS / FID, FVD if enough
samples), and the composed config in .hydra/.
If you already have rollout videos, compute metrics directly and plot comparisons:
python src/sample/evaluate_metrics.py \
--video_dir /path/to/rollout_results --history_length 1 --output_csv metrics.csv
python src/sample/plot_metrics.py \
--csvs metrics_h1.csv metrics_h2.csv metrics_h3.csv --output rollout_comparison.png
model.scheduling_mode | Behavior | Steps |
|---|---|---|
sequential (default) | Frame-by-frame autoregressive denoising. Highest quality. | 250 (default) |
full_sequence | Denoise all frames jointly (DDIM over the whole clip). Faster, slightly lower quality. | 50 is sensible |
python src/main.py experiment=evaluate_only ... \
model.scheduling_mode=full_sequence model.num_sampling_steps=50
All four metrics are computed per-clip and averaged.
| Metric | Direction | Notes |
|---|---|---|
| PSNR | ↑ | per-pixel MSE, in dB |
| SSIM | ↑ | structural similarity, [0, 1] |
| LPIPS | ↓ | learned perceptual distance (AlexNet) |
| FID | ↓ | Fréchet Inception Distance via i3d torchscript |
For longer-horizon videos with enough samples, FVD is also computed.
The i3d model resolves from ${PRETRAINED_MODELS_DIR}/i3d/i3d_torchscript.pt with a
relative fallback pretrained_models/i3d/i3d_torchscript.pt.
256 fixed val samples (seed 42), 250 DDIM steps, sequential scheduling.
| DINO-WM env (NanoWM-B/2) | Steps | PSNR ↑ | SSIM ↑ | LPIPS ↓ | FID ↓ |
|---|---|---|---|---|---|
| Point Maze | 30k | 36.74 | 0.984 | 0.019 | 9.66 |
| Wall | 15k | 34.05 | 0.994 | 0.010 | 2.64 |
| PushT | 100k | 33.19 | 0.982 | 0.016 | 13.63 |
| Rope | 15k | 31.63 | 0.953 | 0.056 | 35.20 |
| Granular | 15k | 26.08 | 0.917 | 0.073 | 40.05 |
| RT-1 (fractal) | Steps | PSNR ↑ | SSIM ↑ | LPIPS ↓ | FID ↓ |
|---|---|---|---|---|---|
| NanoWM-B/2 | 300k | 24.36 | 0.787 | 0.180 | 35.08 |
Each row is a single eval command. Replace <ckpt> with the downloaded HF
checkpoint and <dataset> with the matching
dino_wm/{point_maze,wall,pusht,rope,granular} or rt1/rt1:
python src/main.py experiment=evaluate_only \
dataset=<dataset> model=nanowm_b2 \
resume_from_checkpoint=<ckpt> \
dataset.loader.validation_fixed_subset_size=256 \
dataset.loader.validation_fixed_subset_seed=42
Per-axis ablation numbers live in the
training spec. Success-rate (planning) evaluation and
long-horizon rollout are covered by docs/applications/planning.md and
docs/applications/long_rollout.md.