Valentýn Číhala presents DUSt3R: Geometric 3D Vision Made Easy
On 2024-06-13 11:00:00 at https://feectu.zoom.us/j/98555944426
Multi-view stereo reconstruction (MVS) in the wild requires to first estimate
the camera parameters e.g. intrinsic and extrinsic parameters. These are usually
tedious and cumbersome to obtain, yet they are mandatory to triangulate
corresponding pixels in 3D space, which is the core of all best performing MVS
algorithms. In this work, we take an opposite stance and introduce DUSt3R1, a
radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction of
arbitrary image collections, i.e. operating without prior information about
camera calibration nor viewpoint poses. We cast the pairwise reconstruction
problem as a regression of pointmaps, relaxing the hard constraints of usual
projective camera models. We show that this formulation smoothly unifies the
monocular and binocular reconstruction cases. In the case where more than two
images are provided, we further propose a simple yet effective global alignment
strategy that expresses all pairwise pointmaps in a common reference frame. We
base our network architecture on standard Transformer encoders and decoders,
allowing us to leverage powerful pretrained models. Our formulation directly
provides a 3D model of the scene as well as depth information, but
interestingly, we can seamlessly recover from it, pixel matches, relative and
absolute cameras. Exhaustive experiments on all these tasks showcase that the
proposed DUSt3R can unify various 3D vision tasks and set new SoTAs on
monocular/multi-view depth estimation as well as relative pose estimation. In
summary, DUSt3R makes many geometric 3D vision tasks easy.
the camera parameters e.g. intrinsic and extrinsic parameters. These are usually
tedious and cumbersome to obtain, yet they are mandatory to triangulate
corresponding pixels in 3D space, which is the core of all best performing MVS
algorithms. In this work, we take an opposite stance and introduce DUSt3R1, a
radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction of
arbitrary image collections, i.e. operating without prior information about
camera calibration nor viewpoint poses. We cast the pairwise reconstruction
problem as a regression of pointmaps, relaxing the hard constraints of usual
projective camera models. We show that this formulation smoothly unifies the
monocular and binocular reconstruction cases. In the case where more than two
images are provided, we further propose a simple yet effective global alignment
strategy that expresses all pairwise pointmaps in a common reference frame. We
base our network architecture on standard Transformer encoders and decoders,
allowing us to leverage powerful pretrained models. Our formulation directly
provides a 3D model of the scene as well as depth information, but
interestingly, we can seamlessly recover from it, pixel matches, relative and
absolute cameras. Exhaustive experiments on all these tasks showcase that the
proposed DUSt3R can unify various 3D vision tasks and set new SoTAs on
monocular/multi-view depth estimation as well as relative pose estimation. In
summary, DUSt3R makes many geometric 3D vision tasks easy.
External www: https://cmp.felk.cvut.cz/~toliageo/rg/