Efficient Multi-task Uncertainties for Joint Semantic Segmentation and Monocular Depth Estimation

Machine Vision Metrology (MVM)
Institute of Photogrammetry and Remote Sensing (IPF)
Karlsruhe Institute of Technology (KIT)
DAGM German Conference on Pattern Recognition (GCPR, 2024)
Best Paper Award Honorable Mention

*Corresponding Author

Abstract

Quantifying the predictive uncertainty emerged as a possible solution to common challenges like overconfidence, lack of explainability, and robustness of deep neural networks, albeit one that is often computationally expensive. Many real-world applications are multi-modal in nature and hence benefit from multi-task learning. In autonomous driving or robotics, for example, the joint solution of semantic segmentation and monocular depth estimation has proven to be valuable. To this end, we introduce EMUFormer, a novel student-teacher distillation approach for efficient multi-task uncertainties in the context of joint semantic segmentation and monocular depth estimation. By leveraging the predictive uncertainties of the teacher, EMUFormer achieves new state-of-the-art results on Cityscapes and NYUv2 and additionally estimates high-quality predictive uncertainties for both tasks that are comparable or superior to a Deep Ensemble despite being an order of magnitude more efficient.

Demo Videos

Methodology

U-CE Method Overview

A schematic overview of EMUFormer. In addition to the regular CE loss for the SS task and the GNLL loss, EMUFormer utilizes two additional losses that distill the predictive uncertainties of the teacher into the student model.

Comparison against State of the Art

EMUFormer-B5 surpasses the previous state of the art in joint SS and MDE on both Cityscapes [39] and NYUv2 [245]. For instance, on NYUv2 [245], it achieves 1.4 % higher mIoU and 0.007 lower RMSE than MTFormer [287], which also adopts a modern ViT-based architecture. In contrast to our work, however, MTFormer relies on cross-task attention and a complex self-supervised pre-training pipeline, which introduces additional complexity. Moreover, EMUFormer yields high-quality uncertainty estimates without any additional computational overhead during inference.

Qualitative Results (in-Domain)

Qualitative Results (Out-of-Domain)

Conclusion

EMUFormer employs student-teacher distillation to achieve state-of-the-art results in joint semantic segmentation and monocular depth estimation on Cityscapes [6] and NYUv2 [54]. Simultaneously, it estimates well-calibrated predictive uncertainties for both tasks. This is achieved without introducing any additional computational overhead during inference, making EMUFormer usable for time-critical applications. EMUFormer even surpasses the performance of its DE teacher in certain cases, despite the latter having ten times the parameters and approximately 30 times higher inference time. Most interestingly, EMUFormer achieves particularly outstanding performance in the depth estimation task in comparison to the teacher. This success can be primarily attributed to the use of the Gaussian Negative Log-Likelihood loss (cf. Sect. 5.3), which is commonly employed to implicitly learn corresponding variances in addition to the predictive means. In the case of EMUFormer, however, the teacher model already provides high-quality variances through distillation, allowing for a more accurate approximation of the predictive means and their associated uncertainties. Overall, these findings go along nicely with previous work [28, 32] on leveraging uncertainties during training, making it an interesting venue for future work.

BibTeX

@inproceedings{landgraf2024efficient,
  title={Efficient multi-task uncertainties for joint semantic segmentation and monocular depth estimation},
  author={Landgraf, Steven and Hillemann, Markus and Kapler, Theodor and Ulrich, Markus},
  booktitle={DAGM German Conference on Pattern Recognition},
  pages={348--364},
  year={2024},
  organization={Springer}
}