3DPhysVideo: Consistency-Guided Flow SDE for Video Generation via 3D Scene Reconstruction and Physical Simulation

Hwidong Kim* Yunho Kim* Tae-Kyun Kim
KAIST
*Equal contribution.
Move your cursor across the scene below to breathe life into each part with 3D physics
3DPhysVideo teaser scene

We propose 3DPhysVideo, a training-free framework for 3D physics-conditioned video generation, leveraging an off-the-shelf video model. From a single input scene, our method enables users to apply diverse physical controls to a variety of materials.

Abstract

Video generative models have made remarkable progress, yet they often yield visual artifacts that violate grounding in physical dynamics. Recent works such as PhysGen3D tackle single image-to-3D physics through mesh reconstruction and Physically-Based Rendering, but challenges remain in modeling fluid dynamics, multi-object interactions and photorealism. This work introduces 3DPhysVideo, a novel training-free pipeline that generates physically realistic videos from a single image. We repurpose an off-the-shelf video model for two stages. First, we use it as a novel view synthesizer to reconstruct complete 360-degree 3D scene geometry by guiding the image-to-video (I2V) flow model with rendered point clouds. Second, after applying physics solvers to this geometry, the physically simulated point cloud is used to guide the same I2V flow model to synthesize final, high-quality videos. Consistency-Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias, enforces consistency to the conditional inputs, allowing us to effectively repurpose the model for both 3D reconstruction and simulation-guided video generation. In diverse experiments including multi-object and fluid interaction scenes, our method successfully bridges the gap from single images to physically plausible videos, while remaining efficient to run on a single consumer GPU. It outperforms baselines on GPT-based scores, the VideoPhy benchmark and human evaluation.

Interactive 3D-Physics Control

Snorlax input image
Sand castle input image
Apple input image

Our SDE Across Tasks

Physics simulation
+ Our SDE
Physics simulation
+ Our SDE
Bounding-box trajectory
+ Our SDE
Cut-and-drag
+ Our SDE
Static scene (full 360°)
+ Our SDE
Dynamic scene (zoom out)
+ Our SDE

“A set of five metal spheres hangs in a line. Two spheres on the left swing down in a smooth, slow arc … the two spheres on the far right rise upward, maintaining the exact five-sphere arrangement, all unfolding in a slow, deliberate motion.”

“A Snorlax melting into molten lava, surrounded by flames and glowing embers, as fiery lava bursts and flows around it.”

Naive inference
+ Our SDE
Initial video
+ Our SDE

Comparison with Baselines

Full Pipeline: Single Image to Video Generation

The blue ball strikes the dominos.
Input
PhysGen3D (no result)
The macarons and jellies fall to the table.
Input
The sand castle collapses.
Input
The book falls forward.
Input
Apple drops to the right.
Input
The blue ball drops to the green playdough.
Input
The yellow brick falls right.
Input
The ball on the left moves right, hits the ball on the right.
Input
The snorlax plush toy deflates.
Input
The red can moves right, hits the teddy on the right.
Input

Stage 1: Camera-controlled Video Generation

Static scene
Full 360°
Static scene
Full 360°
Static scene
Full 360°
Static scene
Full 360°
Zoom Out
Translate Down
Translate Up

Stage 2: Motion-conditioned Video Generation

The yellow brick falls right.
Input image
The book falls forward.
Input image
The red can moves right, hits the teddy on the right.
Input image
The snorlax plush toy deflates.
Input image

BibTeX

@article{kim20263dphysvideo,
  title   = {3DPhysVideo: Consistency-Guided Flow SDE for Video Generation
             via 3D Scene Reconstruction and Physical Simulation},
  author  = {Kim, Hwidong and Kim, Yunho and Kim, Tae-Kyun},
  journal = {arXiv preprint arXiv:2605.16795},
  year    = {2026},
}