LeviTor: 3D Trajectory Oriented <br> Image-to-Video Synthesis

LeviTor: 3D Trajectory Oriented
Image-to-Video Synthesis

1State Key Laboratory for Novel Software Technology, Nanjing University,
2Ant Group,  3Zhejiang University,  4The Hong Kong University of Science and Technology
corresponding author

Abstract

The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images.

Note: Please refresh the webpage if the GIFs appear to be out of sync.

🎵 We recommend watching the video with sound on 🎵

 


Showcases

Controlled Occlusion Generation with The Same User-Interative Trajectory

Start Image &
User Input
Generation Results Start Image &
User Input
Generation Results Start Image &
User Input
Generation Results
GIF Animation GIF Animation GIF Animation
Start Image &
User Input
Generation Results Start Image &
User Input
Generation Results Start Image &
User Input
Generation Results
GIF Animation GIF Animation GIF Animation

In the example of bell swinging, the 2D trajectory shows the bell swinging to the right first and then to the left. By assigning different depth values, two distinct swinging trajectories are achieved. The top bell first leans to the back-right and then to the front-left, while the bottom bell first leans to the front-right and then to the back-left.

Better Control for Forward and Backward Object Movements in relation to the Lens

Start Image &
User Input
Generation Results Start Image &
User Input
Generation Results Start Image &
User Input
Generation Results
GIF Animation GIF Animation GIF Animation

Implementation of Complex Motions like Orbiting

Start Image &
User Input
Generation Results Start Image &
User Input
Generation Results Start Image &
User Input
Generation Results
GIF Animation GIF Animation GIF Animation

 


Comparisons

Controlled Occlusion Generation with The Same User-Interative Trajectory

Start Image & User Input
(Ours)
Generation Results
(Ours)
Start Image & User Input
(DragAnything)
Generation Results
(DragAnything)
GIF Animation GIF Animation
Start Image & User Input
(Ours)
Generation Results
(Ours)
Start Image & User Input
(DragNUWA)
Generation Results
(DragNUWA)
GIF Animation GIF Animation

Start Image & User Input
(Ours)
Generation Results
(Ours)
Start Image & User Input
(DragAnything)
Generation Results
(DragAnything)
GIF Animation GIF Animation
Start Image & User Input
(Ours)
Generation Results
(Ours)
Start Image & User Input
(DragNUWA)
Generation Results
(DragNUWA)
GIF Animation GIF Animation

Better Control for Forward and Backward Object Movements in relation to the Lens

Start Image &
User Input
(Ours)
Generation Results
(Ours)
Generation Results
(DragAnything)
Generation Results
(DragNUWA)
GIF Animation GIF Animation GIF Animation

Start Image &
User Input
(Ours)
Generation Results
(Ours)
Generation Results
(DragAnything)
Generation Results
(DragNUWA)
GIF Animation GIF Animation GIF Animation

Implementation of Complex Motions like Orbiting

Start Image &
User Input
(Ours)
Generation Results
(Ours)
Generation Results
(DragAnything)
Generation Results
(DragNUWA)
GIF Animation GIF Animation GIF Animation

Start Image &
User Input
(Ours)
Generation Results
(Ours)
Generation Results
(DragAnything)
Generation Results
(DragNUWA)
GIF Animation GIF Animation GIF Animation

 


Ablations

Ablation on Depth and Instance Information

Start Image & User Input Generation Results
(Ours)
Generation Results
(w/o Instance)
Generation Results
(w/o Depth)
GIF Animation GIF Animation GIF Animation

Ablation on the Number of Inference Control Points

Start Image &
User Input
Generation Results with Default Points Generation Results with Dense Points Start Image &
User Input
Generation Results with Default Points Generation Results with Dense Points
GIF Animation GIF Animation GIF Animation GIF Animation

Start Image &
User Input
Generation Results
with Default Points
Generation Results
with Dense Points
GIF Animation GIF Animation

Comparison with Single-Point Controlled Video Synthesis

Start Image &
User Input
Generation Results with Default Points Generation Results with Single-Point Start Image &
User Input
Generation Results with Default Points Generation Results with Single-Point
GIF Animation GIF Animation GIF Animation GIF Animation

Start Image &
User Input
Generation Results
with Default Points
Generation Results
with Single-Point
GIF Animation GIF Animation