Trimming the Long-Tail of Visual World Modeling Evaluation

1 University of Illinois Urbana-Champaign      2 Stanford University      3 Columbia University

Introduction

Physical interactions follow a long-tailed distribution: a set of common and regular physical interactions dominates human experience and visual data, whereas a broad spectrum of rare and irregular interactions demands reasoning over fundamental object attributes. Although recent world models (e.g., image and video generation models) achieve impressive realism on existing benchmarks, they primarily focus on simulating common, in-distribution physical interactions.

This raises a central question: Do visual world models internalize and generalize the physical principles?

To answer this question, we introduce TAILOR, a benchmark that challenges world models with irregular physical interactions. We evaluate scenarios under two complementary settings—Predictive generation and Descriptive generation—and reveal a pronounced long-tail generalization gap. Our results suggest that current world models exhibit limited understanding of physical principles and struggle with attribute-level generalization under distribution shift.

TAILOR Benchmark

TAILOR Benchmark Overview

TAILOR challenges world models to simulate long-tail scenarios that require reasoning about object attributes.

Regular Scenarios

Reflects common tool-task pairs. Tests whether models can reproduce highly frequent interactions observed in training data.

Unconventional Scenarios

Replaces canonical tools with attribute-compatible substitutes. Tests affordance generalization beyond surface-level associations.

Impossible Scenarios

Introduces attribute-violating tools. Probes constraint awareness and whether models recognize when an interaction should fail.

Two Evaluation Settings

Predictive Generation

We provide a prompt describing the initial setup only (e.g., "A person tries to open a wine bottle with a screwdriver"). The model must infer the physical outcome.

  • Tests: Internalized physical priors and causal "common sense."
  • Insight: Does the model "know" what happens next without being told?

Descriptive Generation

The prompt explicitly specifies the outcome (e.g., "...the screwdriver fails to remove the cork"). This removes the need for physical "guessing."

  • Tests: Fine-grained instruction following and attribute binding.
  • Insight: Can the model override its training biases when told to do so?

Quantitative Results

Automatic and human evaluation results across TAILOR's three scenarios.

Prompt Setting

Best Auto
Best Human
Score View:

Image Generation Models

Model Regular Unconventional Impossible
IA IntAcc Phys Perc IA IntAcc Phys Perc IA IntAcc Phys Perc

Video Generation Models

Model Regular Unconventional Impossible
IA IntAcc Phys Perc IA IntAcc Phys Perc IA IntAcc Phys Perc

Failure Mode Gallery

We organize failure cases by scenario type and modality to expose where current world models break down. Each view summarizes the dominant failure modes observed in the benchmark and pairs them with descriptive and predictive examples.

Scenario

Modality

Prompt

Citation

                    @article{li2026trimming,
                    title={Trimming the Long-Tail of Visual World Modeling Evaluation},
                    author={Li, Bingxuan and Hong, Yining and Qian, Cheng and Ha, Hyeonjeong and Liu, Jiateng and Wang, Zhenhailong and Guo,
                    Yue and Li, Yunzhu and Ji, Heng},
                    journal={arXiv preprint arXiv:2606.24256},
                    year={2026}
                    }