NVIDIA Open-Sources ViPE: 3D Video Understanding AI Tool
NVIDIA has officially released ViPE (Video Pose Engine), a powerful new open-source AI tool that extracts 3D information from raw video footage.
The launch is being hailed as a major step forward in computer vision, spatial intelligence, and AI-powered world modeling.
What is ViPE?
ViPE, short for Video Pose Engine for 3D Geometric Perception, was developed by multiple NVIDIA research teams, including the Spatial Intelligence Lab, Dynamic Vision Lab, NVIDIA Isaac, and NVIDIA Research.
With ViPE, researchers and developers can process everything from casual smartphone clips to cinematic footage, dashcam videos, and even 360-degree panoramic recordings.
The system automatically generates accurate camera motion data and depth maps, offering precise 3D insights that were once extremely difficult to capture.
Unlike older approaches, ViPE doesn’t need pre-calibrated equipment or controlled environments it works directly on raw footage captured “in the wild.”
Key Features and Capabilities
Efficient Processing: ViPE can process video at 3–5 frames per second on a single GPU, making it practical for large-scale annotation.
Multi-Camera Support: It works with perspective, fisheye, and 360-degree cameras, automatically adjusting for each type.
Dynamic Scene Handling: Using GroundingDINO and the Segment Anything Model (SAM), ViPE removes moving objects like cars or people, focusing only on the static environment for accurate motion tracking.
Complex Situations: It performs well even in selfie videos, cinematic shots, and busy traffic footage, where older tools often fail.
Technical Innovation
ViPE combines the best of classical geometry and modern deep learning. It uses a keyframe-based Bundle Adjustment framework enhanced by three tools:
- Dense optical flow networks for tracking motion.
- Sparse feature tracking for pinpoint accuracy.
- Monocular depth estimation models for realistic scale.
This hybrid system delivers what NVIDIA calls “true, real-world metric scale”, meaning the AI doesn’t just guess depth—it measures distances consistent with physical reality.
Benchmarks Performance
According to NVIDIA, ViPE clearly outperforms older uncalibrated pose estimation methods:
- 18% improvement on the TUM dataset for indoor dynamic scenes.
- 50% better results on the KITTI dataset for outdoor driving videos.
- Strong accuracy on SINTEL and ETH3D benchmarks.
Unlike many systems that produce inconsistent or unusable scale data, ViPE provides reliable metric measurements, which is critical for robotics, self-driving cars, and augmented reality.
Massive Dataset Creation
To demonstrate ViPE’s potential, NVIDIA also released a huge dataset called Wild-SDG-1M, containing about 96 million annotated frames.
This dataset includes: 100,000 internet videos (15.7M frames), 1 million AI-generated videos (78M frames), 2,000 panoramic videos with special annotations.
Each frame comes with camera poses, dense depth maps, and intrinsic parameters, making it one of the largest resources ever created for 3D computer vision.
Availability and Access
ViPE is now completely open-source and available on GitHub, with datasets hosted on Hugging Face.
Researchers can download, modify, and use it freely.
Setup is simple users can install it in about five minutes via a conda environment.
Importantly, the tool runs locally, keeping data private and avoiding reliance on cloud services.
News Gist
NVIDIA released ViPE (Video Pose Engine) on September 15, 2025, an open-source AI tool that extracts 3D geometry from videos.
Capable of handling smartphone, dashcam, cinematic, and 360° footage, ViPE delivers accurate depth maps and motion data, advancing robotics, AR/VR, and autonomous systems.
FAQs
Q1: What is NVIDIA ViPE?
ViPE (Video Pose Engine) is an open-source AI tool that extracts 3D geometry, depth maps, and camera motion from raw videos.
Q2: When was ViPE released?
NVIDIA released ViPE on September 15, 2025.
Q3: What makes ViPE different from older tools?
It combines classical geometry with deep learning, works with multiple camera types, removes moving objects, and delivers real-world metric scale accuracy.
Q4: What datasets were released with ViPE?
NVIDIA released Wild-SDG-1M, containing about 96 million annotated frames, including internet, AI-generated, and panoramic videos.
Q5: Where can researchers access ViPE?
The code is available on GitHub, and datasets are hosted on Hugging Face.