We propose 6DGS to estimate the camera pose of a target RGB image given a 3D Gaussian Splatting (3DGS) model representing the scene. Our method estimates a 6DoF pose by inverting the 3DGS rendering process.
Starting from the object surface, we define a radiant Ellicell that uniformly generates rays departing from each ellipsoid that parameterize the 3DGS model. Each Ellicell ray is associated with the rendering parameters of each ellipsoid, which in turn is used to obtain the best bindings between the target image pixels and the cast rays. These pixel-ray bindings are then ranked to select the best scoring bundle of rays, which their intersection provides the camera center and, in turn, the camera rotation. The proposed solution obviates the necessity of an a priori pose for initialization, and it solves 6DoF pose estimation in closed form, without the need for iterations.
Moreover, compared to the existing Novel View Synthesis (NVS) baselines for pose estimation, 6DGS can improve the overall average rotational accuracy by 12% and translation accuracy by 22% on real scenes, despite not requiring any initialization pose. At the same time, our method operates near real-time, reaching 15fps on consumer hardware.
6DGS estimates the camera pose, given a target image and a pre-computed 3DGS model. The image is encoded using a visual backbone (a). Concurrently, rays are uniformly projected from the center of the 3DGS ellipsoids (b), and their corresponding color is estimated. Subsequently, an attention map mechanism is employed to compare the encoded ray and image features (c). Following this comparison, the Ntop matches are selected via attenuation, and the camera location is estimated (d) as the solution of a weighted Least Squares problem, resulting in a distinct 6DoF pose for the image.
Qualitative results on Tanks & Temple and Mip-NeRF 360° datasets, each figure depicts the target images (left up) and their corresponding NVS (left down) based on the camera poses, as these were estimated from 6DGS (on the right of each scene). The NVS of each scene is rendered based on the given 3DGS model. The 3DGS model for each scene is displayed as sub-sampled points deriving from the corresponding ellipsoid centroids. The color of the camera indicates the tested configuration as seen below.
@inproceedings{bortolon20246dgs,
title = {6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model},
author = {Bortolon, Matteo and Tsesmelis, Theodore and James, Stuart and Poiesi, Fabio and {Del Bue}, Alessio},
booktitle = {ECCV},
year = {2024}
}