A Comparative Study on Multi-task Uncertainty Quantification in Semantic Segmentation and Monocular Depth Estimation

Machine Vision Metrology (MVM)
Institute of Photogrammetry and Remote Sensing (IPF)
Karlsruhe Institute of Technology (KIT)
De Gruyter Journal, tm - Technisches Messen (2025)
^*Corresponding Author

Abstract

Deep neural networks excel in perception tasks such as semantic segmentation and monocular depth estimation, making them indispensable in safety-critical applications like autonomous driving and industrial inspection. However, they often suffer from overconfidence and poor explainability, especially for out-of-domain data. While uncertainty quantification has emerged as a promising solution to these challenges, multi-task settings have yet to be explored. In an effort to shed light on this, we evaluate Monte Carlo Dropout, Deep Sub-Ensembles, and Deep Ensembles for joint semantic segmentation and monocular depth estimation. Thereby, we reveal that Deep Ensembles stand out as the preferred choice, particularly in out-of-domain scenarios, and show the potential benefit of multi-task learning with regard to the uncertainty quality in comparison to solving both tasks separately. Additionally, we highlight the impact of employing different uncertainty thresholds to classify pixels as certain or uncertain, with the median uncertainty emerging as a robust default.

Methodology

A schematic overview of the SegFormer [284] and DepthFormer architectures. Both models share the same hierarchical Transformer-based encoder that generates high-resolution coarse features and low-resolution fine features, and a lightweight all-MLP segmentation decoder. They only differ in the number of output channels and in terms of output activations.

A schematic overview of the SegDepthFormer architecture. The model combines the SegFormer [284] architecture with a lightweight all-MLP depth decoder.

Quantitative Results

Quantitative comparison on Cityscapes [39] between SegFormer, DepthFormer, and SegDepthFormer, each paired with MCD, DSEs, and DEs, respectively. See Section 5.3.2 for a concise description of how the uncertainties are calculated. Best results are marked in bold.

Quantitative comparison on NYUv2 [245] between SegFormer, DepthFormer, and SegDepth- Former, each paired with MCD, DSEs, and DEs, respectively. See Section 5.3.2 for a concise description of how the uncertainties are calculated. Best results are marked in bold.

Conclusion

By comparing uncertainty quantification methods in joint semantic segmentation and monocular depth estimation, we find Deep Ensembles offer the best performance and uncertainty quality, albeit at higher computational cost. Deep Sub-Ensembles provide an efficient alternative with minimal trade-offs in predictive performance and uncertainty quality. Additionally, we reveal that multi-task learning can enhance the uncertainty quality of semantic segmentation compared to solving both tasks separately. Furthermore, we show that while the choice of the uncertainty threshold significantly impacts metrics, its influence remains independent of the underlying model or approach and the median uncertainty of an image proves to be a suitable default threshold with high values for p(accurate|certain) and p(uncertain|inaccurate). Lastly, we find that Deep Ensembles exhibit robustness in out-ofdomain scenarios, offering superior predictive performance and uncertainty quality.

BibTeX

@article{landgraf2025comparative, title={A comparative study on multi-task uncertainty quantification in semantic segmentation and monocular depth estimation}, author={Landgraf, Steven and Hillemann, Markus and Kapler, Theodor and Ulrich, Markus}, journal={tm-Technisches Messen}, number={0}, year={2025}, publisher={De Gruyter} }