Aybora Koksal presents Towards Better Visual Understanding on Remote Sensing Imagery with Visual Language Models

On 2026-04-29 14:00:00 at G205, Karlovo náměstí 13, Praha 2
Compact vision–language models (VLMs) offer a practical path to deploying AI
on satellite imagery, but achieving strong visual understanding at small scales
remains challenging. This work presents three complementary directions sharing
a
unifying methodology: compact 2B-scale backbones, chain-of-thought reasoning
supervision where beneficial, and alignment through verifiable rewards via
GRPO/RLVR. SAMChat tackles secluded SAM-site detection with a strengthened
evaluation protocol, achieving high recall and very low false-alarm rates after
CoT+GRPO alignment. TinyRS demonstrates that explicit reasoning helps on
spatially demanding tasks - especially grounding - while concise answering
remains preferable for exact-match VQA under strict scoring. The newest
contribution, Few-Shot RLVR for Satellite Imagery, shows that meaningful gains
can be obtained with as few as 1–128 reward-checkable examples, offering a
data-efficient route when dense captioning is unavailable. Together, these
results support the claim that compact remote-sensing VLMs, when aligned with
explicit reasoning and verifiable rewards, can be both competitive and
deployable.
Responsible person: Petr Pošík