Ivana Benova presents How to Probe Vision-Language Models for Deeper Understanding

On 2025-04-08 - 2025-04-08 11:00:00 at G205, Karlovo náměstí 13, Praha 2

Recent advances in vision-language models (VLMs) have significantly improved
performance of models on downstream tasks. However, a critical challenge
remains—assessing their ability to understand fine-grained linguistic
constructs such as verbs, objects, spatial relations, counting or contextual
reasoning. In this talk, I will present insights from our research on probing
VLMs beyond traditional image-text matching. We introduce novel evaluation
methodologies, including post-retrieval analysis and guided masking, to
investigate models’ robustness in verb comprehension, numerical reasoning, and
the interplay between lexical and world knowledge in multimodal models. Our
findings reveal significant gaps in VLMs' ability to ground complex linguistic
phenomena, highlighting the need for future improvements.