A Critical Synthesis of Uncertainty Quantification and Foundation Models in Monocular Depth Estimation

Abstract

While recent foundation models have enabled significant breakthroughs in monocular depth estimation, a clear path towards safe and reliable deployment in the real-world remains elusive. Metric depth estimation, which involves predicting absolute distances, poses particular challenges, as even the most advanced foundation models remain prone to critical errors. Since quantifying the uncertainty has emerged as a promising endeavor to address these limitations and enable trustworthy deployment, we fuse five different uncertainty quantification methods with the current state-of-the-art DepthAnythingV2 foundation model. To cover a wide range of metric depth domains, we evaluate their performance on four diverse datasets. Our findings identify fine-tuning with the Gaussian Negative Log-Likelihood Loss (GNLL) as a particularly promising approach, offering reliable uncertainty estimates while maintaining predictive performance and computational efficiency on par with the baseline, encompassing both training and inference time. By fusing uncertainty quantification and foundation models within the context of monocular depth estimation, this paper lays a critical foundation for future research aimed at improving not only model performance but also its explainability. Extending this critical synthesis of uncertainty quantification and foundation models into other crucial tasks, such as semantic segmentation and pose estimation, presents exciting opportunities for safer and more reliable machine vision systems.

Methodology

A schematic overview of how to fuse the five different UQ approaches with the DepthAnythingV2 foundation model.

Quantitative Results

Overview of metric depth datasets that we used for evaluation.

Qualitative Results

Qualitative examples for indoor [245], outdoor [39], aerial [197], and robotics [261] scenarios with varying UQ approaches and encoder sizes. Red rectangles are added to highlight interesting areas.

Conclusion

Motivated by the need to bridge the gap between cutting-edge research and the safe deployment of MDE models in realworld applications, we conducted a comprehensive evaluation of multiple UQ methods in conjunction with the state-of-the-art DepthAnythingV2 foundation model. Our evaluation covered five different UQ approaches -- Learned Confidence, Gaussian Negative Log-Likelihood, MC Dropout, Sub-Ensembles, and Test-Time Augmentation -- and was carried out across four diverse dataset, covering various domains relevant to real-world applications: NYUv2, Cityscapes, UseGeo, and HOPE. Our findings highlight fine-tuning with GNLL as the most promising option, consistently delivering high-quality uncertainty estimates while maintaining depth performance comparable to the baseline. Its computational efficiency, which matches that of the baseline, further underscores its practical suitability for deployment. This study emphasizes the importance and feasibility of integrating UQ into machine vision models, demonstrating that achieving reliable uncertainty estimates need not come at the expense of predictive performance or computational complexity. By addressing this critical aspect, we aim to inspire future research that prioritizes not only performance but also explainability through uncertainty awareness, fostering the development of safer and more reliable models for not only MDE but also other real-world tasks, such as semantic segmentation or pose estimation.

BibTeX

@article{landgraf2025critical,
  title={A critical synthesis of uncertainty quantification and foundation models in monocular depth estimation},
  author={Landgraf, Steven and Qin, Rongjun and Ulrich, Markus},
  journal={arXiv preprint arXiv:2501.08188},
  year={2025}
}