Josip Šarić presents What holds back open-vocabulary segmentation?

On 2025-08-26 11:00:00 at G205, Karlovo náměstí 13, Praha 2

Open-vocabulary segmentation leverages vision–language models to handle
concepts beyond the in-domain classes, i.e those presented in the training set
images. Despite early progress, performance has stagnated over the past two
years and remains far behind closed-set, in-domain models.

We introduce oracle components that use ground-truth information to identify
the
issues that cause the gap. The analysis reveals two main problems:
vanilla vision–language models struggle with region-level classification,
and mask decoders fail to generate reliable proposals due to conflicting
training and evaluation objectives.
The findings point to concrete directions that would unlock the future research.

External www: https://arxiv.org/pdf/2508.04211