Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation

Accepted at CVPR 2026 (Highlight)

Cov2Pose: 5 Min Presentation

Introduction

Estimating the 6-DoF pose of an object from a single RGB image is a central problem in computer vision, robotics, augmented reality, and autonomous systems. While many methods rely on intermediate correspondences, keypoints, or refinement stages, Cov2Pose follows a direct pose regression paradigm.

The key idea behind Cov2Pose is to enrich the pose regression pipeline with spatial second-order statistics. Instead of relying solely on first-order or globally pooled features, Cov2Pose uses a covariance-based representation that more effectively captures spatial relationships between regions in the input image, as shown in the figure above. Cov2Pose further introduces a manifold-aware formulation for pose regression by encoding pose information through symmetric positive definite representations. This allows the network to learn pose predictions in a geometry-aware representation space. The main contributions can be summarized as follows,

The first covariance-based deep learning framework for RGB-based end-to-end 6-DoF object pose regression, which leverages the spatial covariance of backbone features to encode higher-order statistics.
A manifold-aware training pipeline that applies geometry-preserving dimensionality reduction using Bilinear Mapping layers (BiMap) [1] to learn a compact SPD representation.
A fully differentiable, CAD-free, one-to-one, and continuous pose regressor that maps the latent SPD matrix to a continuous 6D rotation representation and translation via a differentiable Cholesky decomposition.
Extensive experiments on three object pose benchmarks, namely, LineMOD [2], Occ-LineMOD [3] and YCBVideo [4], showing that Cov2Pose achieves state-ofthe-art results as compared to direct regression methods.

Method

An overview of the proposed pose estimation framework is illustrated in the figure above:

The input image containing the object of interest is fed into a CNN backbone to extract feature maps. Spatial covariance pooling is applied over the \(H \times W\) grid. The resulting covariance matrix is then passed to the pose head, which regresses the object's pose.
Rotation vectors and translation are decoded from the compact \(4 \times 4\) SPD matrix using a differentiable Cholesky layer. The rotation is then mapped to \(\mathrm{SO}(3)\) using Gram-Schmidt orthogonalization.
Each layer of the pose regression head applies a BiMap operation followed by an eigenvalue rectification step [1], preserving the strict positive definiteness of the representation and enabling effective manifold-aware optimization.

Experiments

Cov2Pose is evaluated on standard 6-DoF object pose estimation benchmarks and compared against existing direct and indirect pose estimation approaches. Additional results on the task of Absolute camera Pose Regression (APR) on Cambridge Landmarks and spacecraft pose estimation task on SPEED+ can be found on the supplementary material of the paper.

Quantitative Results

Qualitative Results

Acknowledgements

The present work is supported by the National Research Fund (FNR), Luxembourg, under the C21/IS/15965298/ELITE project, and by Infinite Orbits. Experiments were performed on the Luxembourg national supercomputer MeluXina at LuxProvide.

BibTeX

@InProceedings{Ousalah_2026_CVPR,
    author    = {Ousalah, Nassim Ali and Rostami, Peyman and Gaudilli\`ere, Vincent and Koumandakis, Emmanuel and Kacem, Anis and Ghorbel, Enjie and Aouada, Djamila},
    title     = {Cov2Pose: Leveraging Spatial Covariance for Direct Manifold-aware 6-DoF Object Pose Estimation},
    booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
    month     = {June},
    year      = {2026},
    pages     = {40727-40738}
}

References

Zhiwu Huang and Luc Van Gool. Riemannian network for SPD matrix learning. Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
Stefan Hinterstoisser, Vincent Lepetit, Slobodan Ilic, Stefan Holzer, Gary Bradski, Kurt Konolige, and Nassir Navab. Model-based training, detection, and pose estimation of texture-less 3D objects in heavily cluttered scenes. In Computer Vision – ACCV 2012, pages 548–562. Springer, Berlin, Heidelberg, 2013.
Eric Brachmann, Alexander Krull, Frank Michel, Stefan Gumhold, Jamie Shotton, and Carsten Rother. Learning 6D object pose estimation using 3D object coordinates. In Computer Vision – ECCV 2014, pages 536–551. Springer, Cham, 2014.
Yu Xiang, Tanner Schmidt, Venkatraman Narayanan, and Dieter Fox. PoseCNN: A convolutional neural network for 6D object pose estimation in cluttered scenes. Robotics: Science and Systems, 2018.