Yi Zhu presents AIM: Adapting Image Models for Efficient Video Understanding
On 2023-02-23 16:00:00 at on line, zoom link: https://feectu.zoom.us/j/96078936185
Recent vision transformer based video models mostly follow the ``image
pre-training then finetuning" paradigm and have achieved great success on
multiple video benchmarks. However, full finetuning such a video model could be
computationally expensive and unnecessary, given the pre-trained image
transformer models have demonstrated exceptional transferability. In this work,
we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient
video understanding. By freezing the pre-trained image model and adding a few
lightweight Adapters, we introduce spatial adaptation, temporal adaptation and
joint adaptation to gradually equip an image model with spatiotemporal reasoning
capability. We show that our proposed AIM can achieve competitive or even better
performance than prior arts with substantially fewer tunable parameters on four
video action recognition benchmarks. Thanks to its simplicity, our method is
also generally applicable to different image pre-trained models, which has the
potential to leverage more powerful image foundation models in the future.
pre-training then finetuning" paradigm and have achieved great success on
multiple video benchmarks. However, full finetuning such a video model could be
computationally expensive and unnecessary, given the pre-trained image
transformer models have demonstrated exceptional transferability. In this work,
we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient
video understanding. By freezing the pre-trained image model and adding a few
lightweight Adapters, we introduce spatial adaptation, temporal adaptation and
joint adaptation to gradually equip an image model with spatiotemporal reasoning
capability. We show that our proposed AIM can achieve competitive or even better
performance than prior arts with substantially fewer tunable parameters on four
video action recognition benchmarks. Thanks to its simplicity, our method is
also generally applicable to different image pre-trained models, which has the
potential to leverage more powerful image foundation models in the future.
External www: https://openreview.net/forum?id=CIoSZ_HKHS7