Aybora Koksal presents Towards Better Visual Understanding on Remote Sensing Imagery with Visual Language Models
On 2026-04-29 14:00:00 at G205, Karlovo náměstí 13, Praha 2
Compact vision–language models (VLMs) offer a practical path to deploying AI
on satellite imagery, but achieving strong visual understanding at small scales
remains challenging. This work presents three complementary directions sharing
a
unifying methodology: compact 2B-scale backbones, chain-of-thought reasoning
supervision where beneficial, and alignment through verifiable rewards via
GRPO/RLVR. SAMChat tackles secluded SAM-site detection with a strengthened
evaluation protocol, achieving high recall and very low false-alarm rates after
CoT+GRPO alignment. TinyRS demonstrates that explicit reasoning helps on
spatially demanding tasks - especially grounding - while concise answering
remains preferable for exact-match VQA under strict scoring. The newest
contribution, Few-Shot RLVR for Satellite Imagery, shows that meaningful gains
can be obtained with as few as 1–128 reward-checkable examples, offering a
data-efficient route when dense captioning is unavailable. Together, these
results support the claim that compact remote-sensing VLMs, when aligned with
explicit reasoning and verifiable rewards, can be both competitive and
deployable.
on satellite imagery, but achieving strong visual understanding at small scales
remains challenging. This work presents three complementary directions sharing
a
unifying methodology: compact 2B-scale backbones, chain-of-thought reasoning
supervision where beneficial, and alignment through verifiable rewards via
GRPO/RLVR. SAMChat tackles secluded SAM-site detection with a strengthened
evaluation protocol, achieving high recall and very low false-alarm rates after
CoT+GRPO alignment. TinyRS demonstrates that explicit reasoning helps on
spatially demanding tasks - especially grounding - while concise answering
remains preferable for exact-match VQA under strict scoring. The newest
contribution, Few-Shot RLVR for Satellite Imagery, shows that meaningful gains
can be obtained with as few as 1–128 reward-checkable examples, offering a
data-efficient route when dense captioning is unavailable. Together, these
results support the claim that compact remote-sensing VLMs, when aligned with
explicit reasoning and verifiable rewards, can be both competitive and
deployable.