Trimming the Long-Tail of
Visual World Modeling Evaluation

Bingxuan Li¹ Yining Hong² Cheng Qian¹ Hyeonjeong Ha¹ Jiateng Liu¹ Zhenhailong Wang¹ Yue Guo¹ Yunzhu Li³ Heng Ji¹

¹ University of Illinois Urbana-Champaign ² Stanford University ³ Columbia University

Introduction

Physical interactions follow a long-tailed distribution: a set of common and regular physical interactions dominates human experience and visual data, whereas a broad spectrum of rare and irregular interactions demands reasoning over fundamental object attributes. Although recent world models (e.g., image and video generation models) achieve impressive realism on existing benchmarks, they primarily focus on simulating common, in-distribution physical interactions.

This raises a central question: Do visual world models internalize and generalize the physical principles?

To answer this question, we introduce TAILOR, a benchmark that challenges world models with irregular physical interactions. We evaluate scenarios under two complementary settings—Predictive generation and Descriptive generation—and reveal a pronounced long-tail generalization gap. Our results suggest that current world models exhibit limited understanding of physical principles and struggle with attribute-level generalization under distribution shift.

TAILOR Benchmark

TAILOR challenges world models to simulate long-tail scenarios that require reasoning about object attributes.

Regular Scenarios

Reflects common tool-task pairs. Tests whether models can reproduce highly frequent interactions observed in training data.

Unconventional Scenarios

Replaces canonical tools with attribute-compatible substitutes. Tests affordance generalization beyond surface-level associations.

Impossible Scenarios

Introduces attribute-violating tools. Probes constraint awareness and whether models recognize when an interaction should fail.

Two Evaluation Settings

Predictive Generation

We provide a prompt describing the initial setup only (e.g., "A person tries to open a wine bottle with a screwdriver"). The model must infer the physical outcome.

Tests: Internalized physical priors and causal "common sense."
Insight: Does the model "know" what happens next without being told?

Descriptive Generation

The prompt explicitly specifies the outcome (e.g., "...the screwdriver fails to remove the cork"). This removes the need for physical "guessing."

Tests: Fine-grained instruction following and attribute binding.
Insight: Can the model override its training biases when told to do so?

Quantitative Results

Automatic and human evaluation results across TAILOR's three scenarios.

Prompt Setting

Best Auto

Best Human

Score View:

Image Generation Models

Model	Regular				Unconventional				Impossible
	IA	IntAcc	Phys	Perc	IA	IntAcc	Phys	Perc	IA	IntAcc	Phys	Perc

Video Generation Models

Model	Regular				Unconventional				Impossible
	IA	IntAcc	Phys	Perc	IA	IntAcc	Phys	Perc	IA	IntAcc	Phys	Perc

Failure Mode Gallery

We organize failure cases by scenario type and modality to expose where current world models break down. Each view summarizes the dominant failure modes observed in the benchmark and pairs them with descriptive and predictive examples.

Scenario

Modality

Prompt

Citation

                    @article{li2026trimming,
                    title={Trimming the Long-Tail of Visual World Modeling Evaluation},
                    author={Li, Bingxuan and Hong, Yining and Qian, Cheng and Ha, Hyeonjeong and Liu, Jiateng and Wang, Zhenhailong and Guo,
                    Yue and Li, Yunzhu and Ji, Heng},
                    journal={arXiv preprint arXiv:2606.24256},
                    year={2026}
                    }