The intuitive nature of drag-based interaction has led to its growing adoption for controlling object trajectories in image-to-video synthesis. 
	Still, existing methods that perform dragging in the 2D space usually face ambiguity when handling out-of-plane movements. 
	In this work, we augment the interaction with a new dimension, i.e., the depth dimension, such that users are allowed to assign a relative depth for each point on the trajectory. 
	That way, our new interaction paradigm not only inherits the convenience from 2D dragging, but facilitates trajectory control in the 3D space, broadening the scope of creativity. 
	We propose a pioneering method for 3D trajectory control in image-to-video synthesis by abstracting object masks into a few cluster points. 
	These points, accompanied by the depth information and the instance information, are finally fed into a video diffusion model as the control signal. 
	Extensive experiments validate the effectiveness of our approach, dubbed LeviTor, in precisely manipulating the object movements when producing photo-realistic videos from static images. 
	
 Note: Please refresh the webpage if the GIFs appear to be out of sync.
        
 
  Controlled Occlusion Generation with The Same User-Interative Trajectory
| Start Image & User Input | Generation Results | Start Image & User Input | Generation Results | Start Image & User Input | Generation Results | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  |  |  | ||||||||
| Start Image & User Input | Generation Results | Start Image & User Input | Generation Results | Start Image & User Input | Generation Results | ||||||||
|  |  |  |  |  |  | ||||||||
In the example of bell swinging, the 2D trajectory shows the bell swinging to the right first and then to the left. By assigning different depth values, two distinct swinging trajectories are achieved. The top bell first leans to the back-right and then to the front-left, while the bottom bell first leans to the front-right and then to the back-left.
Better Control for Forward and Backward Object Movements in relation to the Lens
| Start Image & User Input | Generation Results | Start Image & User Input | Generation Results | Start Image & User Input | Generation Results | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  |  |  | ||||||||
Implementation of Complex Motions like Orbiting
| Start Image & User Input | Generation Results | Start Image & User Input | Generation Results | Start Image & User Input | Generation Results | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  |  |  | ||||||||
Controlled Occlusion Generation with The Same User-Interative Trajectory
| Start Image & User Input (Ours) | Generation Results (Ours) | Start Image & User Input (DragAnything) | Generation Results (DragAnything) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  | ||||||||||||
| Start Image & User Input (Ours) | Generation Results (Ours) | Start Image & User Input (DragNUWA) | Generation Results (DragNUWA) | ||||||||||||
|  |  |  |  | ||||||||||||
| Start Image & User Input (Ours) | Generation Results (Ours) | Start Image & User Input (DragAnything) | Generation Results (DragAnything) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  | ||||||||||||
| Start Image & User Input (Ours) | Generation Results (Ours) | Start Image & User Input (DragNUWA) | Generation Results (DragNUWA) | ||||||||||||
|  |  |  |  | ||||||||||||
Better Control for Forward and Backward Object Movements in relation to the Lens
| Start Image & User Input (Ours) | Generation Results (Ours) | Generation Results (DragAnything) | Generation Results (DragNUWA) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  | ||||||||
| Start Image & User Input (Ours) | Generation Results (Ours) | Generation Results (DragAnything) | Generation Results (DragNUWA) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  | ||||||||
Implementation of Complex Motions like Orbiting
| Start Image & User Input (Ours) | Generation Results (Ours) | Generation Results (DragAnything) | Generation Results (DragNUWA) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  | ||||||||
| Start Image & User Input (Ours) | Generation Results (Ours) | Generation Results (DragAnything) | Generation Results (DragNUWA) | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  | ||||||||
Ablation on Depth and Instance Information
| Start Image & User Input | Generation Results (Ours) | Generation Results (w/o Instance) | Generation Results (w/o Depth) | 
|---|---|---|---|
|  |  |  |  | 
Ablation on the Number of Inference Control Points
| Start Image & User Input | Generation Results with Default Points | Generation Results with Dense Points | Start Image & User Input | Generation Results with Default Points | Generation Results with Dense Points | ||||
|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  |  |  | ||||
| Start Image & User Input | Generation Results with Default Points | Generation Results with Dense Points | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|  |  |  | ||||||||||||||||
Comparison with Single-Point Controlled Video Synthesis
| Start Image & User Input | Generation Results with Default Points | Generation Results with Single-Point | Start Image & User Input | Generation Results with Default Points | Generation Results with Single-Point | ||||
|---|---|---|---|---|---|---|---|---|---|
|  |  |  |  |  |  | ||||
| Start Image & User Input | Generation Results with Default Points | Generation Results with Single-Point | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|  |  |  | ||||||||||||||||