IFFNeRF: Initialisation Free and Fast 6DoF pose estimation from a single image and a NeRF model

1, 2, 3, 4

From (i) a given image with an unknown pose and (ii) a NeRF model, we recover the pose by first (iii) sampling surface points using Metropolis-Hasting algorithm and (iv) casting rays from them in isocell distribution. We then (iv/v) correlate rays with the image to identify relevant rays using attention and (vii) recover the unknown camera pose.

Abstract

We introduce IFFNeRF to estimate the camera pose of a given image, building on the neural radiance fields (NeRF) formulation. IFFNeRF is specifically designed to operate in real-time and eliminates the need for an initial pose guess that is proximate to the sought solution.

IFFNeRF utilizes the Metropolis-Hasting algorithm to sample surface points from within the NeRF model. From these sampled points, we cast rays and deduce the color for each ray through pixel-level view synthesis. The camera pose can then be estimated as the solution to a Least Squares problem by selecting correspondences between the query image and the resulting bundle. We facilitate this process through a learned attention mechanism, bridging the query image embedding with the embedding of parameterized rays, thereby matching rays pertinent to the image.

Through synthetic and real evaluation settings, we show that our method can improve the angular and translation error accuracy by 80.1% and 67.3%, respectively, compared to iNeRF while not requiring the initial pose guess and performing at 34fps on consumer hardware.

Video

Method

IFFNeRF aims to predict the camera pose given an observed image and a pre-computed NeRF model. We firstly apply a Metropolis-Hastings algorithm to sample surface points within the scene volume, then we cast a set rays from an isocell at each surface point. We then learn an attention map between embeddings of the image and generated rays. Based on the information contained in the attention map, we select a subset of candidate rays that are likely to fall within the image. Finally, to recover the camera at test time, we optimize using Least Squares over the selected rays.

Results

Qualitative results on two datasets, NeRF synthetic and Tanks & Temples. We show each scene using the points sampled using Metropolis-Hastings and the color of the camera indicates the tested configuration as seen below.

Camera configurations

Ground Truth
IFFNeRF
iNeRF with pose prior
iNeRF without pose prior

Barn (Tanks & Temples)


Ignatius (Tanks & Temples)


Ficus (NeRF-synthetic 360°)


Ship (NeRF-synthetic 360°)

BibTeX

@inproceedings{bortolon2024iffnerf,
  title = {IFFNeRF: Initialization Free and Fast 6DoF pose estimation from a single image and a NeRF model},
  author = {Bortolon, Matteo and Tsesmelis, Theodore and James, Stuart and Poiesi, Fabio and {Del Bue}, Alessio},
  journal = {ICRA},
  year = {2024}
}