Researchers from the University of California, Berkeley, and the University of Hong Kong have developed a new approach to metric depth estimation, a crucial technology for various applications in the energy sector, including robotics, autonomous vehicles, and renewable energy infrastructure inspection. The team, led by Baorui Ma and Jiahui Yang, presents their findings in a paper titled “MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources.”
Metric depth estimation involves determining the distance of each pixel in an image from the camera, enabling 3D scene understanding. However, this task has been challenging due to noise from different sensors, camera-specific biases, and metric ambiguity in diverse 3D data. The researchers introduce MetricAnything, a scalable pretraining framework that learns metric depth from noisy and diverse 3D sources without relying on manually engineered prompts or task-specific architectures.
The key innovation in MetricAnything is the Sparse Metric Prompt, which randomly masks depth maps to serve as a universal interface. This approach decouples spatial reasoning from sensor and camera biases, allowing the model to learn from a wide range of data sources. The researchers trained their model using about 20 million image-depth pairs, spanning reconstructed, captured, and rendered 3D data across 10,000 camera models. This extensive training data enabled them to demonstrate a clear scaling trend in metric depth estimation, a first in the field.
The pretrained model excels at various prompt-driven tasks, such as depth completion, super-resolution, and radar-camera fusion. Additionally, a distilled prompt-free student model achieved state-of-the-art results in monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and Visual Localization and Mapping (VLA) planning. The researchers also showed that using the pretrained Vision Transformer (ViT) of MetricAnything as a visual encoder significantly boosts the spatial intelligence capabilities of Multimodal Large Language Models.
For the energy sector, these advancements could lead to improved robotics and autonomous systems for inspecting and maintaining renewable energy infrastructure, such as wind turbines and solar panels. Enhanced 3D scene understanding could also benefit energy storage and management systems, as well as smart grid technologies. The researchers have open-sourced MetricAnything to support community research and further advancements in the field. The paper was published in the prestigious conference Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
This article is based on research available at arXiv.

