Trimming the Long-Tail of
Visual World Modeling
Evaluation
1 University of Illinois Urbana-Champaign 2 Stanford University 3 Columbia University
Physical interactions follow a long-tailed distribution: a set of common and regular physical interactions dominates human experience and visual data, whereas a broad spectrum of rare and irregular interactions demands reasoning over fundamental object attributes. Although recent world models (e.g., image and video generation models) achieve impressive realism on existing benchmarks, they primarily focus on simulating common, in-distribution physical interactions.
To answer this question, we introduce TAILOR, a benchmark that challenges world models with irregular physical interactions. We evaluate scenarios under two complementary settings—Predictive generation and Descriptive generation—and reveal a pronounced long-tail generalization gap. Our results suggest that current world models exhibit limited understanding of physical principles and struggle with attribute-level generalization under distribution shift.
TAILOR Benchmark
TAILOR challenges world models to simulate long-tail scenarios that require reasoning about object attributes.
Regular Scenarios
Reflects common tool-task pairs. Tests whether models can reproduce highly frequent interactions observed in training data.
Unconventional Scenarios
Replaces canonical tools with attribute-compatible substitutes. Tests affordance generalization beyond surface-level associations.
Impossible Scenarios
Introduces attribute-violating tools. Probes constraint awareness and whether models recognize when an interaction should fail.
Two Evaluation Settings
Predictive Generation
We provide a prompt describing the initial setup only (e.g., "A person tries to open a wine bottle with a screwdriver"). The model must infer the physical outcome.
- Tests: Internalized physical priors and causal "common sense."
- Insight: Does the model "know" what happens next without being told?
Descriptive Generation
The prompt explicitly specifies the outcome (e.g., "...the screwdriver fails to remove the cork"). This removes the need for physical "guessing."
- Tests: Fine-grained instruction following and attribute binding.
- Insight: Can the model override its training biases when told to do so?
Quantitative Results
Automatic and human evaluation results across TAILOR's three scenarios.
Prompt Setting
Image Generation Models
| Model | Regular | Unconventional | Impossible | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IA | IntAcc | Phys | Perc | IA | IntAcc | Phys | Perc | IA | IntAcc | Phys | Perc | |
Video Generation Models
| Model | Regular | Unconventional | Impossible | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IA | IntAcc | Phys | Perc | IA | IntAcc | Phys | Perc | IA | IntAcc | Phys | Perc | |
Failure Mode Gallery
We organize failure cases by scenario type and modality to expose where current world models break down. Each view summarizes the dominant failure modes observed in the benchmark and pairs them with descriptive and predictive examples.
Scenario
Modality
Prompt
Citation
@article{li2026trimming,
title={Trimming the Long-Tail of Visual World Modeling Evaluation},
author={Li, Bingxuan and Hong, Yining and Qian, Cheng and Ha, Hyeonjeong and Liu, Jiateng and Wang, Zhenhailong and Guo,
Yue and Li, Yunzhu and Ji, Heng},
journal={arXiv preprint arXiv:2606.24256},
year={2026}
}