Dim Papadopoulos presents Looking twice: Test-Time Scaling and Priors for Vision Models

On 2026-05-12 10:00:00 at G205, Karlovo náměstí 13, Praha 2
In this talk, I will focus on how to get more out of vision and vision-language
models at inference time, rather than through additional training-time scale. I
will present three recent projects from my group, each taking a distinct angle.
The first, Efficient Test-Time Scaling for Small Vision-Language Models (ICLR
2026), shows that careful test-time strategies allow VLMs to match much larger
models on perception-heavy tasks. The second, a comparison of visual
autoregressive and diffusion models under matched inference-time compute
(preprint, 2026), finds that VAR scales more favorably than diffusion at
inference time. The third, HiddenObjects (preprint, 2026), distills spatial
priors from a diffusion model into a fast network for object placement, treating
diffusion as a teacher of plausible layouts rather than as a generator. I will
close with MMLandmarks (CVPR 2026), a new cross-view, instance-level benchmark
for geo-spatial understanding that exposes clear failure modes of current vision
and retrieval models. The discussion will emphasize the trade-offs between
compute, model size, and structural priors.
Responsible person: Petr Pošík