Papers
arxiv:2510.20888

Video-As-Prompt: Unified Semantic Control for Video Generation

Published on Oct 23
· Submitted by Yuxuan BIAN on Oct 27
#2 Paper of the day
Authors:
,
,
,
,
,

Abstract

Video-As-Prompt (VAP) uses a reference video to guide a frozen Video Diffusion Transformer via a Mixture-of-Transformers expert, achieving state-of-the-art results in semantic-controlled video generation with strong zero-shot generalization.

AI-generated summary

Unified, generalizable semantic control in video generation remains a critical open challenge. Existing methods either introduce artifacts by enforcing inappropriate pixel-wise priors from structure-based controls, or rely on non-generalizable, condition-specific finetuning or task-specific architectures. We introduce Video-As-Prompt (VAP), a new paradigm that reframes this problem as in-context generation. VAP leverages a reference video as a direct semantic prompt, guiding a frozen Video Diffusion Transformer (DiT) via a plug-and-play Mixture-of-Transformers (MoT) expert. This architecture prevents catastrophic forgetting and is guided by a temporally biased position embedding that eliminates spurious mapping priors for robust context retrieval. To power this approach and catalyze future research, we built VAP-Data, the largest dataset for semantic-controlled video generation with over 100K paired videos across 100 semantic conditions. As a single unified model, VAP sets a new state-of-the-art for open-source methods, achieving a 38.7% user preference rate that rivals leading condition-specific commercial models. VAP's strong zero-shot generalization and support for various downstream applications mark a significant advance toward general-purpose, controllable video generation.

Community

Paper author Paper submitter
edited 2 days ago

Text prompts still not giving you the video you really want?
Then stop relying on text — use a video as the prompt!

Introducing Video-As-Prompt — the first unified framework for semantic-controllable video generation.

Project Page: https://bytedance.github.io/Video-As-Prompt/
Code: https://github.com/bytedance/Video-As-Prompt
Dataset: https://huggingface.co/datasets/BianYx/VAP-Data
Model: https://huggingface.co/collections/ByteDance/video-as-prompt
Demo Video: https://www.youtube.com/watch?v=S3zpLIMOU4c

Sign up or log in to comment

Models citing this paper 2

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.20888 in a Space README.md to link it from this page.

Collections including this paper 9