Semantic segmentation is critical for scene understanding but demands costly pixel-wise annotations, attracting increasing attention to semi-supervised approaches to leverage abundant unlabeled data. While semi-supervised segmentation is often promoted as a path toward scalable, real-world deployment, it is astonishing that current evaluation protocols exclusively focus on segmentation accuracy, entirely overlooking reliability and robustness. These qualities, which ensure consistent performance under diverse conditions (robustness) and well-calibrated model confidences as well as meaningful uncertainties (reliability), are essential for safety-critical applications like autonomous driving, where models must handle unpredictable environments and avoid sudden failures at all costs. To address this gap, we introduce the Reliable Segmentation Score (RSS), a novel metric that combines predictive accuracy, calibration, and uncertainty quality measures via a harmonic mean. RSS penalizes deficiencies in any of its components, providing an easy and intuitive way of holistically judging segmentation models. Comprehensive evaluations of UniMatchV2 against its predecessor and a supervised baseline show that semi-supervised methods often trade reliability for accuracy. While out-of-domain evaluations demonstrate UniMatchV2's robustness, they further expose persistent reliability shortcomings. We advocate for a shift in evaluation protocols toward more holistic metrics like RSS to better align semi-supervised learning research with real-world deployment needs.
In this work, we addressed a critical blind spot in semi-supervised semantic segmentation: The lack of attention to model reliability and robustness. To fill this gap, we introduced the Reliable Segmentation Score (RSS), which holistically integrates accuracy, calibration, and uncertainty quality into a single metric through the harmonic mean, penalizing poor performance in any component. Through comprehensive evaluations on both in-domain and out-of-domain scenarios, we revealed that the current state-of-the-art UniMatchV2 achieves superior predictive performance and robustness but is often less calibrated and produces less reliable uncertainty estimates than its supervised counterpart. These findings raise legitimate questions about whether incremental gains in segmentation accuracy reflect meaningful progress toward reliable and robust deployment. We hope that our investigation and the proposed RSS metric serve as a stepping stone toward more principled evaluation protocols that better align research objectives with real-world requirements, focusing not only on performance but also on reliability and robustness.
@article{landgraf2025rethinking,
title={Rethinking Semi-supervised Segmentation Beyond Accuracy: Reliability and Robustness},
author={Landgraf, Steven and Hillemann, Markus and Ulrich, Markus},
journal={arXiv preprint arXiv:2506.05917},
year={2025}
}