Sajid Javed presents Marine Image Analysis, Datasets, and Applications (! Date change !)
On 2025-07-31 11:00:00 at G205, Karlovo náměstí 13, Praha 2
The preservation of aquatic biodiversity is critical in mitigating the
effects of climate change. Aquatic scene understanding plays a pivotal
role in aiding marine scientists in their decision-making processes. In
this paper, I will talk about AquaticCLIP, a novel contrastive
language-image pre-training model tailored for aquatic scene
understanding. AquaticCLIP presents a new unsupervised learning framework
that aligns images and texts in aquatic environments, enabling tasks such
as segmentation, classification, detection, and object counting. By
leveraging our large-scale underwater image-text paired dataset without
the need for ground-truth annotations, our model enriches existing
vision-language models in the aquatic domain. For this purpose, we
construct a 2 million underwater image-text paired dataset using
heterogeneous resources including YouTube, Netflix, NatGeo, etc. To
fine-tune AquaticCLIP, we propose a prompt-guided vision encoder that
progressively aggregates patch features via learnable prompts, while a
vision-guided mechanism enhances the language encoder by incorporating
visual context. The model is optimized through a contrastive pre-
training loss to align visual and textual modalities. AquaticCLIP
achieves notable performance improvements in zero-shot settings across
multiple underwater computer vision tasks, outperforming existing methods
in both robustness and interpretability. Our model sets a new benchmark
for vision-language applications in underwater environments. The code and
dataset for AquaticCLIP are publicly available
effects of climate change. Aquatic scene understanding plays a pivotal
role in aiding marine scientists in their decision-making processes. In
this paper, I will talk about AquaticCLIP, a novel contrastive
language-image pre-training model tailored for aquatic scene
understanding. AquaticCLIP presents a new unsupervised learning framework
that aligns images and texts in aquatic environments, enabling tasks such
as segmentation, classification, detection, and object counting. By
leveraging our large-scale underwater image-text paired dataset without
the need for ground-truth annotations, our model enriches existing
vision-language models in the aquatic domain. For this purpose, we
construct a 2 million underwater image-text paired dataset using
heterogeneous resources including YouTube, Netflix, NatGeo, etc. To
fine-tune AquaticCLIP, we propose a prompt-guided vision encoder that
progressively aggregates patch features via learnable prompts, while a
vision-guided mechanism enhances the language encoder by incorporating
visual context. The model is optimized through a contrastive pre-
training loss to align visual and textual modalities. AquaticCLIP
achieves notable performance improvements in zero-shot settings across
multiple underwater computer vision tasks, outperforming existing methods
in both robustness and interpretability. Our model sets a new benchmark
for vision-language applications in underwater environments. The code and
dataset for AquaticCLIP are publicly available