Spaces:
Running
on
Zero
A newer version of the Gradio SDK is available:
5.49.1
Video Depth Anything
Sili Chen · Hengkai Guo† · Shengnan Zhu · Feihu Zhang
Zilong Huang · Jiashi Feng · Bingyi Kang†
ByteDance
†Corresponding author
This work presents Video Depth Anything based on Depth Anything V2, which can be applied to arbitrarily long videos without compromising quality, consistency, or generalization ability. Compared with other diffusion-based models, it enjoys faster inference speed, fewer parameters, and higher consistent depth accuracy.
News
- 2025-03-11: Add full dataset inference and evaluation scripts.
- 2025-02-08: Enable autocast inference. Support grayscale video, NPZ and EXR output formats.
- 2025-01-21: Paper, project page, code, models, and demo are all released.
Release Notes
- 2025-02-08: 🚀🚀🚀 Inference speed and memory usage improvement
The Latency and GPU VRAM results are obtained on a single A100 GPU with input of shape 1 x 32 x 518 × 518.Model Latency (ms) GPU VRAM (GB) FP32 FP16 FP32 FP16 Video-Depth-Anything-V2-Small 9.1 7.5 7.3 6.8 Video-Depth-Anything-V2-Large 67 14 26.7 23.6
Pre-trained Models
We provide two models of varying scales for robust and consistent video depth estimation:
| Model | Params | Checkpoint |
|---|---|---|
| Video-Depth-Anything-V2-Small | 28.4M | Download |
| Video-Depth-Anything-V2-Large | 381.8M | Download |
Usage
Preparation
git clone https://github.com/DepthAnything/Video-Depth-Anything
cd Video-Depth-Anything
pip install -r requirements.txt
Download the checkpoints listed here and put them under the checkpoints directory.
bash get_weights.sh
Inference a video
python3 run.py --input_video ./assets/example_videos/davis_rollercoaster.mp4 --output_dir ./outputs --encoder vitl
Options:
--input_video: path of input video--output_dir: path to save the output results--input_size(optional): By default, we use input size518for model inference.--max_res(optional): By default, we use maximum resolution1280for model inference.--encoder(optional):vitsfor Video-Depth-Anything-V2-Small,vitlfor Video-Depth-Anything-V2-Large.--max_len(optional): maximum length of the input video,-1means no limit--target_fps(optional): target fps of the input video,-1means the original fps--fp32(optional): Usefp32precision for inference. By default, we usefp16.--grayscale(optional): Save the grayscale depth map, without applying color palette.--save_npz(optional): Save the depth map innpzformat.--save_exr(optional): Save the depth map inexrformat.
Citation
If you find this project useful, please consider citing:
@article{video_depth_anything,
title={Video Depth Anything: Consistent Depth Estimation for Super-Long Videos},
author={Chen, Sili and Guo, Hengkai and Zhu, Shengnan and Zhang, Feihu and Huang, Zilong and Feng, Jiashi and Kang, Bingyi}
journal={arXiv:2501.12375},
year={2025}
}
LICENSE
Video-Depth-Anything-Small model is under the Apache-2.0 license. Video-Depth-Anything-Large model is under the CC-BY-NC-4.0 license. For business cooperation, please send an email to Hengkai Guo at guohengkaighk@gmail.com.
