MotionCrafter: One-Shot Motion Customization of Diffusion Models

1MAIS, Institute of Automation, Chinese Academy of Sciences, 2School of AI, UCAS, 3Institute of Computing Technology, Chinese Academy of Sciences 4Kuaishou Technology

MotionCrafter can generate new videos with same the motion as the given reference.

Abstract

The essence of a video lies in its dynamic motions, including character actions, object movements, and camera movements. While text-to-video generative diffusion models have recently advanced in creating diverse contents, controlling specific motions through text prompts remains a significant challenge. A primary issue is the coupling of appearance and motion, often leading to overfitting on appearance. To tackle this challenge, we introduce MotionCrafter, a novel one-shot instance-guided motion customization method. MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model, while the spatial module is independently adjusted for character or style control. To enhance the disentanglement of motion and appearance, we propose an innovative dual-branch motion disentanglement approach, comprising a motion disentanglement loss and an appearance prior enhancement strategy. During training, a frozen base model provides appearance normalization, effectively separating appearance from motion and thereby preserving diversity. Comprehensive quantitative and qualitative experiments, along with user preference tests, demonstrate that MotionCrafter can successfully integrate dynamic motions while preserving the coherence and quality of the base model with a wide range of appearance generation capabilities.

MotionCrafter

To decompose the appearance and motion of the generated videos, we propose a parallel spatial-temporal architecture. It leverages two separate paths to learn the appearance and motion information from videos, corresponding to the spatial and temporal modules in the backbone of a text-to-video generation model. To achieve better disentanglement, we further design a dual-branch motion disentanglement based on an information bottleneck. We incorporate a frozen branch of the base model to serve as an appearance prior. During training, the framework takes a reference video and enhanced textual conditioning as inputs and fine-tunes the trainable branch. During inference, our framework takes user-provided textual conditioning as input and generates results that incorporate the reference video information using only the fine-tuned branch.

Interpolate start reference image.

The overall pipeline of MotionCrafter.

Video

BibTeX

@article{zhang2023motioncrafter,
  title={MotionCrafter: One-Shot Motion Customization of Diffusion Models},
  author={Zhang, Yuxin and Tang, Fan and Huang, Nisha and Huang, Haibin and Ma, Chongyang and Dong, Weiming and Xu, Changsheng},
  journal={arXiv preprint arXiv:2312.05288},
  year={2023}
}