Our GRASPLAT method combines the precision of 3D grasping predictors with the accessibility of RGB-based approaches. Conventional 3D grasping models (left) require an object scan (e.g., point cloud) for precise hand pose estimation, consequently making them difficult to use in real-world scenarios. RGB-based grasping models (right) eliminate the need for explicit 3D scans but suffer from reduced accuracy. GRASPLAT (center) bridges this gap by leveraging novel view synthesis to supervise a grasp predictor that operates directly on RGB images. This enables high-precision grasping without requiring object scans, significantly improving efficiency and practicality.

Abstract

Dexterous robotic grasping with multi-fingered hands remains a significant challenge in robotics. We propose GRASPLAT, a novel approach that combines the usability of RGB methods with the accuracy of 3D techniques through advanced Novel View Synthesis (NVS).

Our key insight is that by synthesizing physically plausible images of a hand grasping an object, we can accurately regress corresponding hand joint configurations for successful grasps without requiring complete 3D scans during inference.

GRASPLAT leverages 3D Gaussian Splatting to generate high-fidelity novel views of real hand-object interactions, enabling end-to-end training with RGB data. Our approach compares rendered novel views with actual grasp appearances using photometric loss, minimizing discrepancies between rendered and real images.

Extensive experiments on both synthetic and real-world datasets demonstrate that GRASPLAT improves grasp success rates by up to 36.9% compared to existing image-based methods.

Video

Method

GRASPLAT predicts 3D hand poses from a single RGB image using a frozen, denoised DINOv2 backbone. The extracted feature map is processed by a spatial transformer to estimate hand parameters, which are then refined by an MLP. During training, a feedback mechanism based on 3D Gaussian Splatting (3DGS) refines predictions by comparing rendered and real images. To accelerate training, 3DGS models are precomputed before training and then cached. The rendered scene includes both the object and the articulated hand, posed according to the predicted parameters.

GRASPLAT dataset sample

Sample video from the GRASPLAT dataset showing various hand-object interactions. The video is randomly selected from a subset of 30 videos with different objects, if you refresh the page you will see a different video.
-->

Acknowledgments

This project received support from the H2020 EU project RePAIR (Grant Agreement n° 964854), the Center for Responsible AI, and MPR-2023-12- SACCCT Project 14935 AI.PackBot.
This work was made during an internship of the first author in IST, Lisbon.

BibTeX

@misc{bortolon2025grasplat,
  title={GRASPLAT: Enabling dexterous grasping through novel view synthesis},
  author={Bortolon, Matteo and Duarte, Nuno Ferreira and Moreno, Plinio and Poiesi, Fabio and Santos-Victor, José and Del Bue, Alessio},
  journal = {iROS},
  year={2025},
}