Text-to-video models have demonstrated impressive capabilities in producing diverse and captivating video content, showcasing a notable advancement in generative AI. However, these models generally lack fine-grained control over motion patterns, limiting their practical applicability. We introduce MotionFlow, a novel framework designed for motion transfer in video diffusion models. Our method utilizes cross-attention maps to accurately capture and manipulate spatial and temporal dynamics, enabling seamless motion transfers across various contexts. Our approach does not require training and works on test-time by leveraging the inherent capabilities of pre-trained video diffusion models. In contrast to traditional approaches, which struggle with comprehensive scene changes while maintaining consistent motion, MotionFlow successfully handles such complex transformations through its attention-based mechanism. Our qualitative and quantitative experiments demonstrate that MotionFlow significantly outperforms existing models in both fidelity and versatility even during drastic scene alterations.
Our invert-then-generate method operates in two main stages: (1) Inversion, where DDIM inversion is used to extract latent representations and cross-attention maps from the original video, generating target masks that capture the subject's motion and spatial details; (2) Generation, where these masks and a text prompt guide the creation of a new video, aligning with the original video's motion dynamics and spatial layout while adhering to the semantic content of the prompt.
Source Video
A robot walking across ancient stone ruins
A drone hovering on rocky branches in a tropical rainforest
Source Video
A horse riding on a forest road
A tractor riding on a farm road
Source Video
A penguin swimming in Antarctic waters
A koi swimming in a Japanese garden pond
Source Video
A horse jumping into a river
A panda jumping into a river in a bamboo forest
Source Video
A moose crossing a road in a snowy landscape
A gorilla crossing a road in a rainforest
Source Video
A motorbike rides in a forest
A yacht rides near a coastal forest
Original
Ours
DMT [1]
MotionDirector [2]
Motion Inversion [3]
VMC [4]
Original
Ours
DMT [1]
MotionDirector [2]
Motion Inversion [3]
VMC [4]
Original
Ours
DMT [1]
MotionDirector [2]
Motion Inversion [3]
VMC [4]
CLIP text similarity versus Motion Fidelity scores for each baseline. Our method exhibits a better balance between these two metrics.
@article{meral2024motionflow,
title={MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models},
author={Meral, Tuna Han Salih and Yesiltepe, Hidir and Dunlop, Connor and Yanardag, Pinar},
journal={arXiv preprint arXiv:2412.05275},
year={2024}
}