LIVE: Long-horizon Interactive Video world modEl

Teaser. LIVE achieves bounded error accumulation for stable long-horizon video world modeling. Top: Qualitative comparison with baselines and FID curves showing LIVE maintains stable quality while other methods degrade as rollout length increases. Bottom: Applications in real-world (RealEstate10K) and gaming environments (Minecraft, UE Engine).

Existing Paradigms

Autoregressive video world models face a fundamental challenge: error accumulation during generation. Prior approaches include Teacher Forcing (TF), which uses ground truth during training but suffers from train-inference mismatch; Diffusion Forcing (DF), which injects noise but fails to model real rollout errors; and Self-Forcing (SF), which employs sequence-level distillation but requires pre-trained teacher models and suffers from unbounded error accumulation.

Figure 2. Comparison of autoregressive training paradigms. Teacher Forcing (TF) uses ground truth context during training, causing train-inference mismatch. Diffusion Forcing (DF) injects noise but fails to model real rollout errors. Self-Forcing (SF) employs sequence-level distillation with unbounded error accumulation. Our LIVE performs forward rollout then reverse recovery with frame-level diffusion loss, bounding errors through the cycle-consistency objective.

Figure 3. Challenge: Rollout from GT produces semantically diverse content, making direct supervision infeasible. LIVE addresses this by requiring the model to generate back toward the original GT, enabling valid supervision through the cycle-consistency objective.

Method

LIVE introduces a framework that enforces bounded error accumulation via a cycle-consistency constraint. Specifically, LIVE performs a forward rollout from ground-truth (GT) frames followed by a reverse generation process to reconstruct the initial state, on which the diffusion loss is computed. This formulation explicitly enforces cycle consistency by training the model to map its own imperfect rollouts back to the GT manifold.

Figure 4. LIVE training pipeline. Forward rollout (Left, frozen): Given p prompt frames xⁱ, the model generates the remaining T-p frames via causal attention. Cycle-consistency objective (Right, trainable): The rollout is reversed and used as context to recover the original prompt frames via frame-level diffusion loss, employing reverse attention (right mask, shown for p=2). Top: Progressive training curriculum by increasing rollout ratio. From left to right, as p decreases, more generated frames enter the context, increasing the model's error tolerance while maintaining recoverability through the cycle-consistency objective.

BibTeX

@misc{huang2026livelonghorizoninteractivevideo, title={LIVE: Long-horizon Interactive Video World Modeling}, author={Junchao Huang and Ziyang Ye and Xinting Hu and Tianyu He and Guiyu Zhang and Shaoshuai Shi and Jiang Bian and Li Jiang}, year={2026}, eprint={2602.03747}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2602.03747}, }