Zero-shot customized video editing with diffusion feature transfer
2025
Customized video editing aims at substituting the object in a given source video with a target object from reference images (Fig. 1). Existing approaches often rely on fine-tuning pre-trained models by learning the appearance of the objects in the reference images, as well as the temporal information from the source video. These methods are however not scalable as fine-tuning is required for each source video and each object to be customized, incurring computational overhead. More importantly, such individual customization often leads to overfitting to a few given reference images. In this paper, by leveraging the pre-trained Stable Diffusion model, we propose FreeMix, a zero-shot customized video editing approach. From careful empirical analysis of the diffusion features, we observe that the motion information of the moving object in the source video are captured by the low-frequency high-level diffusion features, while high-frequency low-level diffusion features encode more the object's appearance in the reference images. By exploiting this observation, we achieve effective motion transfer from the source video and appearance transfer from reference images to synthesize the output video via simple feature transfer in the diffusion model. To enhance temporal consistency of the synthesized video, we apply optimal transport to low-level diffusion features of consecutive source video frames, establishing feature correspondences to guide video generation. We conduct comprehensive experiments, demonstrating that our approach surpasses both state-of-the-art customized video editing methods that require fine-tuning and general text-based video editing methods.
Research areas