6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model

1, 2, 3, 4

Our 6DGS method departs from conventional analysis-by-synthesis methodologies. Standard NeRF-based methods (left) employ an iterative process, rendering candidate poses and comparing them with the target image before updating the pose. In contrast, 6DGS (right) estimates the camera pose by selecting a bundle of rays projected from the ellipsoid surface (a radiant Ellicell) and learning an attention map to output ray/image pixel correspondences (based on DINOv2). The optimal bundle of rays should intersect the optical center of the camera and then are used to estimate the camera rotation in closed-form. Our 6GDS method offers significantly improved accuracy and speed, enabling the recovery of the pose within a one-shot estimate.

Abstract

We propose 6DGS to estimate the camera pose of a target RGB image given a 3D Gaussian Splatting (3DGS) model representing the scene. Our method estimates a 6DoF pose by inverting the 3DGS rendering process.

Starting from the object surface, we define a radiant Ellicell that uniformly generates rays departing from each ellipsoid that parameterize the 3DGS model. Each Ellicell ray is associated with the rendering parameters of each ellipsoid, which in turn is used to obtain the best bindings between the target image pixels and the cast rays. These pixel-ray bindings are then ranked to select the best scoring bundle of rays, which their intersection provides the camera center and, in turn, the camera rotation. The proposed solution obviates the necessity of an a priori pose for initialization, and it solves 6DoF pose estimation in closed form, without the need for iterations.

Moreover, compared to the existing Novel View Synthesis (NVS) baselines for pose estimation, 6DGS can improve the overall average rotational accuracy by 12% and translation accuracy by 22% on real scenes, despite not requiring any initialization pose. At the same time, our method operates near real-time, reaching 15fps on consumer hardware.

Video

Method

6DGS estimates the camera pose, given a target image and a pre-computed 3DGS model. The image is encoded using a visual backbone (a). Concurrently, rays are uniformly projected from the center of the 3DGS ellipsoids (b), and their corresponding color is estimated. Subsequently, an attention map mechanism is employed to compare the encoded ray and image features (c). Following this comparison, the Ntop matches are selected via attenuation, and the camera location is estimated (d) as the solution of a weighted Least Squares problem, resulting in a distinct 6DoF pose for the image.

Results

Qualitative results on Tanks & Temple and Mip-NeRF 360° datasets, each figure depicts the target images (left up) and their corresponding NVS (left down) based on the camera poses, as these were estimated from 6DGS (on the right of each scene). The NVS of each scene is rendered based on the given 3DGS model. The 3DGS model for each scene is displayed as sub-sampled points deriving from the corresponding ellipsoid centroids. The color of the camera indicates the tested configuration as seen below.

Camera configurations

Ground Truth
6DGS
iNeRF w/ pose prior
iNeRF w/o pose prior
Parallel NeRF w/ pose prior
Parallel NeRF w/o pose prior
NeMo + VoGE w/ pose prior
NeMo + VoGE w/o pose prior

Truck

Tanks & Temples

Target image
Estimated NVS

Counter

Mip-NeRF 360°

Target image
Estimated NVS

BibTeX

@inproceedings{bortolon20246dgs,
  title = {6DGS: 6D Pose Estimation from a Single Image and a 3D Gaussian Splatting Model},
  author = {Bortolon, Matteo and Tsesmelis, Theodore and James, Stuart and Poiesi, Fabio and {Del Bue}, Alessio},
  booktitle = {ECCV},
  year = {2024}
}