Depth Anything: Revolutionizing Monocular Depth Estimation

Depth Anything introduces a groundbreaking methodology for monocular depth estimation, eschewing the need for new technical modules.

‍

Instead, it prioritizes the expansion of datasets via a distinct data engine that automatically annotates an extensive collection of unlabeled images, approximately 62 million in total. This approach substantially broadens data coverage and diminishes the model's generalization error.

‍

As of its publication in January 2024, Depth Anything is recognized as the current state-of-the-art on the NYU-Depth V2 dataset, showcasing its exceptional capability in enhancing depth estimation accuracy and robustness.

‍

What is Depth Anything?

Depth Anything is a groundbreaking methodology for enhancing monocular depth estimation by leveraging a large volume of unlabeled data. It does not rely on novel technical modules but on the vast scale of data and clever strategies to improve the model's ability to generalize across different images and conditions.

‍

By automatically annotating nearly 62 million unlabeled images, Depth Anything vastly expands the training dataset, enabling the model to learn from a much wider variety of scenes and lighting conditions than previously possible.

‍

How Depth Anything Works?

Depth Anything operates by scaling up the dataset through a data engine that collects and automatically annotates a large number of unlabeled images. This process significantly enlarges data coverage, which is crucial for reducing the model's generalization error.

‍

Core methodology

The methodology involves two key strategies: leveraging data augmentation tools to create a more challenging optimization target and developing auxiliary supervision to enforce the model to inherit rich semantic priors from pre-trained encoders. These strategies compel the model to actively seek extra visual knowledge and acquire robust representations.

‍

Depth Anything Pipeline — Depth Anything pipeline. [1]

‍

Solid Line (Labeled Images): The process for labeled images is straightforward, where the labeled image goes through feature alignment, benefiting from manual labels and supplementary methods such as LiDAR matching or Structure from Motion (SfM) to generate labeled predictions. This path emphasizes traditional supervised learning, where the depth estimation model learns from explicitly labeled data.

‍

Dotted Line (Unlabeled Images): For unlabeled images, the document showcases the innovative approach of utilizing large-scale unlabeled data to improve the model. Unlabeled images are enhanced with strong perturbations (denoted as 'S' in the diagram), which are designed to challenge the student model.

‍

This involves adding noise or variations to the input images during training to force the model to learn more robust and generalizable features.

These perturbed images are then processed through the same encoder-decoder structure

as labeled images. However, instead of relying on manual labels, they leverage pseudo labels generated by the teacher model.

‍

This process underscores the semi-supervised learning aspect where the model also learns from unlabeled data, significantly enriched with semantic information and robustness through imposed challenges.

‍

Semantic Preservation and Auxiliary Constraint: The pipeline emphasizes preserving semantic information across the process, especially for unlabeled images, by enforcing an auxiliary constraint between the online student model and a frozen encoder.

‍

This constraint is designed to ensure that despite the absence of explicit labels, the model's predictions for unlabeled images retain semantic coherence with the learned representations, enhancing depth estimation accuracy and reliability.

‍

Integration of Labeled and Unlabeled Learning: The diagram reflects the integration of learning from both labeled and unlabeled images, where the model benefits from the explicit information of labeled data and the vast, diverse visual knowledge contained in unlabeled images. This integrated learning approach is key to achieving robust depth estimation across a wide range of scenes and conditions.

‍

The training of Depth Anything model leverages both labeled and unlabeled images through a combination of traditional supervised learning and innovative semi-supervised techniques. This approach significantly enhances the model's depth estimation capabilities by expanding its exposure to diverse data and challenging learning scenarios.

‍

Performance

The performance of Depth Anything is compared with other models across various datasets and metrics, demonstrating its superior capability in monocular depth estimation:

‍

Comparison with MiDaS

Depth Anything exhibits stronger zero-shot capability than MiDaS, especially highlighted by its performance in downstream fine-tuning performance on NYUv2 and KITTI datasets.

Metrics comparison of MiDas vs Depth Anything — [1]

AbsRel (Absolute Relative Difference): This metric measures the average absolute relative difference between the predicted and actual depth values across all pixels in the dataset.
δ1: This metric evaluates the accuracy of depth predictions by measuring the percentage of predicted depth values that fall within a threshold factor of the actual depth values.

‍

For instance, Depth Anything achieved an Absolute Relative Difference (AbsRel) of 0.056 and a δ1 metric of 0.984 on NYUv2, compared to MiDaS's 0.077 AbsRel and 0.951 δ1, showcasing significant improvements both in accuracy and the ability to predict depth information across different scenes.

Comparison Depth Anything vs MiDas — [2]

‍

Application

Applications of monocular depth estimation (MDE) models like Depth Anything span across various domains, significantly benefiting fields that rely on understanding the spatial configuration of scenes from single images. Here are some key applications:

‍

Control Net Image Diffusion: This application leverages depth estimation to inform the diffusion process in generating detailed and contextually accurate images. By understanding the spatial layout and depth of a scene, these models can produce images that not only look visually appealing but also maintain a realistic perspective and depth.

‍

This is particularly useful in fields like graphic design, where artists and designers can create more lifelike scenes and visuals for various media, including video games, movies, and virtual reality experiences.

‍

Robotics: In robotics, accurate depth estimation is crucial for navigation, object manipulation, and interaction with the environment. Robots can use depth information to avoid obstacles, plan paths, and perform tasks in complex environments.

‍

Autonomous Driving: For autonomous vehicles, understanding the depth of objects and the environment is vital for safe navigation. Depth estimation assists in detecting obstacles, estimating distances, and planning safe paths through traffic.

‍

Virtual Reality (VR): In VR, depth estimation can enhance immersive experiences by providing more accurate spatial representations of virtual environments, improving interaction and realism.

‍

These applications demonstrate the broad impact of advances in MDE, like those achieved by Depth Anything, in enhancing machine perception and interaction with the physical world.

‍

Easily run Depth Anything for Depth Estimation

Setup

With the Ikomia API, you can effortlessly extract depth map on your image with Depth Anything with just a few lines of code.

‍

To get started, you need to install the API in a virtual environment [3].

‍


pip install ikomia

‍

Run Depth Anything with a few lines of code

‍

You can also directly charge the notebook we have prepared.

Go to notebook

Go to Colab


from ikomia.dataprocess.workflow import Workflow
from ikomia.utils.displayIO import display

# Init your workflow
wf = Workflow()

# Add algorithm
algo = wf.add_task(name="infer_depth_anything", auto_connect=True)

algo.set_parameters({
        'model_name':'LiheYoung/depth-anything-base-hf',
        'cuda': 'True'})

# Run directly on your image
wf.run_on(url="https://images.pexels.com/photos/19748906/pexels-photo-19748906.jpeg?cs=srgb&dl=pexels-david-su%C3%A1rez-19748906.jpg&fm=jpg&w=1920&h=2880")

# Display the results
display(algo.get_input(0).get_image())
display(algo.get_output(0).get_image())

‍

List of parameters:

model_name (str) - default 'LiheYoung/depth-anything-base-hf': Name of the ViT pre-trained model.

- 'LiheYoung/depth-anything-small-hf' ; Param: 24.8M

- 'LiheYoung/depth-anything-base-hf' ; Param: 97.5M

- 'LiheYoung/depth-anything-large-hf' ; Param: 335.3M‍

cuda (bool): If True, CUDA-based inference (GPU). If False, run on CPU.

Depth Anything depth estimation on a photo of a man cycling — Original image source. [4]

‍

Resources

Browse the Ikomia HUB showcasing a variety of algorithms, complete with easy-to-access code snippets that simplify experimentation and evaluation.
Get comprehensive instructions for maximizing the use of the API can be found in the Ikomia documentation.
Test Ikomia STUDIO complements the ecosystem by offering a more visual and intuitive approach to image processing, featuring a user-friendly interface that reflects the API's capabilities.

‍