Detail of the student project

Topic:Learning Local Features with Weak Supervision
Department:Katedra kybernetiky
Supervisor:Torsten Sattler, Dr. rer. nat.
Announce as:Diplomová práce, Semestrální projekt
Description:Motivation. Local features such as SIFT are a vital component of many 3D computer vision algorithms, including Structure-from-Motion, SLAM, and visual localization. These algorithms rely on local features for data association, e.g., to establish correspondences between images or between images and 3D points. They fail if local features are not able to provide sufficiently many matches. In order to obtain features that are more robust to viewing condition changes, e.g., to changes in viewpoint or to changes illumination conditions such as day-night changes, deep learning is increasingly used to learn local features [2,3]. Most approaches for learning local features rely on pixel-level correspondences between images. However, obtaining such correspondences is either hard or requires use of synthetic data (potentially leading to generalization problems). A recently proposed alternative is to use weak supervision that does not provide pixel-level annotations, e.g., only use information about whether two images depict the same part of the scene [4] or their relative poses [1]. The promise of such approaches is that weak supervision signals are much easier to obtain, making it easier to generate large training datasets, and thus increasing the chance to learn better features.

Scientific Objective. Obtaining weak supervision signals at scale, e.g., in the form of relative camera poses [1], implies automating the process. Concretely, for the example of using relative poses, this means automatically running Structure-from-Motion or SLAM on a large amount of data. Obviously, one needs to use some form of local feature for Structure-from-Motion and SLAM. At the same time, Structure-from-Motion reconstructions also provide pixel-level correspondences between images (although of a sparse nature) that can be used to train supervised methods. A natural question is thus to understand whether and to what degree using weak supervision allows us to learn better features.

Project description. The overall goal of the project is to understand whether weak supervised learning of local features offers benefits over strongly supervised learning and if yes, how large the improvement is. To this end, we will use the 3D models obtained via Structure-from-Motion to train local feature descriptors in both a strongly supervised and a weakly supervised way. For the former, we will follow common practice and extract local patches around the local features that belong to 3D points in the Structure-from-Motion model and use them to train a better descriptor [5]. For the latter, we will use the camera pose-based approach from [1]. Having learned better descriptors on the datasets used in [1], we will compare both strongly supervised and weakly supervised descriptors on a variety of tasks [1,5]. In addition, we will evaluate the impact of the training dataset size on both descriptors.
It was recently shown that each local descriptor can also be used to define a local feature detector [2,6]. For our experiment, we will thus evaluate the learned descriptors using a standard detectors such as [3] as well as the detectors defined by the learned descriptors.
This project can lead to a publication in a major Computer Vision conference such as CVPR, ECCV, or ICCV.
Bibliography:[1] Qianqian Wang, Xiaowei Zhou, Bharath Hariharan, Noah Snavely, Learning Feature Descriptors using Camera Pose Supervision, ECCV 2020
[2] Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, Torsten Sattler, D2-Net: A Trainable CNN for Joint Detection and Description of Local Features, CVPR 2019
[3] Daniel DeTone, Tomasz Malisiewicz, Andrew Rabinovich, SuperPoint: Self-Supervised Interest Point Detection and Description, CVPR Workshops 2018.
[4] Ignacio Rocco, Relja Arandjelović, Josef Sivic, Sparse Neighbourhood Consensus Networks, ECCV 2020.
[5] Anastasiya Mishchuk, Dmytro Mishkin, Filip Radenovic, Jiri Matas, Working hard to know your neighbor's margins: Local descriptor learning loss, NeurIPS 2017.
[6] Yurun Tian, Vassileios Balntas, Tony Ng, Axel Barroso-Laguna, Yiannis Demiris, Krystian Mikolajczyk, D2D: Keypoint Extraction with Describe to Detect Approach, ACCV 2020.
Responsible person: Petr Pošík