0% found this document useful (0 votes)
16 views

I2c-Net Using Instance-Level Neural Networks For M

i2c-Net_Using_Instance-Level_Neural_Networks_for_M

Uploaded by

rksaawarn2005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

I2c-Net Using Instance-Level Neural Networks For M

i2c-Net_Using_Instance-Level_Neural_Networks_for_M

Uploaded by

rksaawarn2005
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

This article has been accepted for publication in IEEE Robotics and Automation Letters.

This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362

IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2023 1

i2c-net: using instance-level neural networks for monocular


category-level 6D pose estimation
Alberto Remus1 , Salvatore D’Avella1 , Francesco Di Felice1 , Paolo Tripicchio1 and Carlo Alberto Avizzano1

Abstract—Object detection and pose estimation are strict


requirements for many robotic grasping and manipulation ap-
plications to endow robots with the ability to grasp objects with
different properties in cluttered scenes and with various lighting
conditions. This work proposes the framework i2c-net to extract
the 6D pose of multiple objects belonging to different categories,
starting from an instance-level pose estimation network and
relying only on RGB images. The network is trained on a custom-
made synthetic photo-realistic dataset, generated from some
base CAD models, opportunely deformed, and enriched with
real textures for domain randomization purposes. At inference
time, the instance-level network is employed in combination
with a 3D mesh reconstruction module, achieving category-level
capabilities. Depth information is used for post-processing as a Fig. 1: On the left, an instance-level network that cannot
correction. Tests conducted on real objects of the YCB-V and generalize to unseen instances in the wild. On the right,
NOCS-REAL datasets outline the high accuracy of the proposed the same instance-level enriched with the presented i2c-net
approach. framework shows category-level capabilities.
Index Terms—Perception for Grasping and Manipulation;
Deep Learning for Visual Perception; RGB-D Perception
decades. Instead, pose estimation from a single image is not
I. I NTRODUCTION yet a mature field, and there is still space for improvements
toward a reliable solution. One problem is that extracting 3D
Nowadays, robots are used in a wide range of applica- information from a single color image is an ill-posed problem
tions, including advanced manufacturing [1], human-robot since the structures of the objects are retrievable only up-to-
[2] collaboration, and logistics [3] that require a high level scale. The other aspect is the lack of labeled real data, whose
of autonomy. Two of the primary tasks for robots in such collection is a difficult and time-consuming task. Humans can
applications are object grasping and manipulation, and robots rely on stereo-vision or eye motion, and they can also exploit
have to show the ability to adapt to the changing environment a strong knowledge of the surrounding environment. In that
while interacting with the surroundings to perform such tasks direction, some approaches exist that use multiple points of
efficiently in line with the concept of Industry 4.0. A key factor view to extract the pose of an object or rely on point clouds
is the grasping of a variety of objects. In order to foster the gap, [8]. However, the computation time is long and increases with
an important aspect for autonomous and reliable grasping of the number of viewpoints. An alternative to manually labeled
arbitrary objects involves object detection and pose estimation, data is the use of synthetic photo-realistic datasets [9] that,
which are challenging tasks as objects can present different in combination with the sim-to-real transfer, allow for a high
sizes, material properties, and texture appearances, and they number of training data and can also be applicable to real-
can be occluded in cluttered scenes with different lighting world scenarios.
conditions. In addition, a desirable factor is that the method
Estimating the 6D pose of an object from an RGB image,
should be fast enough, especially for stringent cycle times in
taking into account also the scale factor, requires adding 3D
industries. Therefore, even though a lot has been done in recent
information. One of the most diffused approaches is to use
years, pose estimation for autonomous and reliable grasping
3D Computer Aided Design (CAD) models of the objects
of different objects remains an open challenge [4].
composed of vertices and faces. Doing this, most of the
Thanks to the advantages of deep learning methods [5], approaches are constrained to objects whose CAD models are
classification, detection [6], and segmentation [7] of objects used during training, thus hampering generalization, even to
from images received a significant step forward in the past objects of the same category [10]. If available, an alternative
Manuscript received: October, 19, 2022; Revised December, 14, 2022; is to use depth information adding a higher computational
Accepted January, 18, 2023. burden. However, simulated depth images are more exposed
This paper was recommended for publication by Editor Markus Vincze to the sim-to-real problem than RGB images.
upon evaluation of the Associate Editor and Reviewers’ comments. This
work was supported by Leonardo Company S.p.A. under grant No. This work presents instance-to-category net or i2c-net to
LDO/CTI/P/0026995/21, July 2nd , 2021 extract the 6D pose of multiple objects belonging to certain
1 All the authors are with Department of Excellence in Robotics & AI,
categories starting from an instance-level pose estimation
Mechanical Intelligence Institute, Scuola Superiore Sant’Anna, Pisa, Italy
[email protected] network and exploiting a custom-made synthetic dataset for
Digital Object Identifier (DOI): see top of this page. training. The idea is to extract as much 3D information as

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362

2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2023

possible from RGB images and available CAD models and of categories of objects sharing some common properties C =
use depth information for post-processing as a correction. ∪Nk=1 Ck , and a set of objects O belonging to such different
Considering some known categories, like, for example, bottles categories, the objective is to find, for each RGB(-D) image
or cans, a few baseCAD models, i.e., CAD models for known Ij ∈ I and object instance Mi ∈ O the mapping (Ij , Mi ) 7−→
objects that encompass the diverse shapes of the objects in (Ck , Rij , tij , sij ), where Ck is the category which the object
that category, are used as a starting point for a deformation Mi belongs to, and sij ∈ R3 its size. It is worth noticing that
procedure. Such deformations generate new object models the CAD model is not available for each object instance during
that are enriched with real textures for domain randomization inference otherwise the problem ends up in the instance-level
purposes. The instance-level pose estimation network can be setting. For that reason, the sij term represents the size of the
trained on such augmented photo-realistic images, and at 3D bounding box tightly surrounding the object to solve the
inference time it is used in combination with a 3D mesh scale ambiguity.
reconstruction network achieving category-level capabilities.
Several tests are conducted on real objects of the YCB-V [11]
Such a scale factor is relevant for robotic grasping scenarios
and NOCS-REAL 275 [12] datasets to assess the performance
to determine whether the target object can fit the gripper open-
of the proposed framework. Figure 1 shows a qualitative
ing. This family of problems leaves space for investigation,
comparison between the proposed framework and the original
especially concerning the exploitation of RGB information.
GDR-Net, which demonstrated to have promising performance
Indeed, state-of-the-art approaches, like Shape-Prior [16] (or
as an instance-level pose estimation network.
SPD), and FS-Net [17], instead of using only color images
The rest of the paper is organized as follows: section II
during training, extract a point cloud from the depth image
describes the considered problem, related works, and open
of the observed object and apply a set of 3D deformations
challenges; section III details the structure of the presented
to augment the available data and catch intra-category salient
pipeline, from a general viewpoint to a fine-grained description
features. Existing methods can also be classified as single-
of the neural network models employed; section IV collects
stage or multiple-stage estimators. In the former, the training
experimental results with qualitative and quantitative compar-
process of both detection and pose estimation outputs is
isons with state-of-the-art approaches; and section V is devoted
performed jointly, as in the case of key-point-based approaches
to conclusions by summarizing the contribution of the work.
presented in [10] or [18]. Multiple-stage methods like [14] or
[15], instead, can benefit from higher modularity considering
II. P ROBLEM DEFINITION AND R ELATED WORK the vast improvement that 2D detectors have been going
through.
One of the key aspects of autonomous and reliable grasping
is the pose estimation of the target object. Existing approaches
that extract the 6D pose information from a single RGB(- A dual problem to 6D pose estimation is 3D model re-
Depth) image can be mainly distinguished in instance-level construction. Explicit shape representation [19] is a class of
and category-level methods. The first type is biased by the approaches that approximates a surface as a function of 2D
CAD model and texture properties of the object used during coordinates and enhances the granularity of this approximation
training and does not properly work if changes are applied by increasing the number of edges, triangles, or vertices
to such an object. The instance-level pose estimation problem at the expense of additional processing time. Variations in
can be formalized as follows: given a set of images I and a the objects’ topology within the same category can hamper
set of objects O for which the CAD model is available, the performances due to the possible presence of holes and gaps
objective is to find, for each RGB(-D) image Ij ∈ I and object inside the 3D model, thus leaving space for the so-called im-
instance Mi ∈ O, the mapping (Ij , Mi ) 7−→ (Rij , tij ), where plicit surface representations [20]. Signed Distance Functions
Rij ∈ SO(3) is the object rotation matrix and tij ∈ R3 is the (SDF) are the most widespread implicit model that computes
3D translation vector for the particular object instance with the distance to the closest surface for each considered 3D
respect to the camera frame. The presence of the 3D CAD point and assigns a positive (negative) sign if the point is
model resolves the ill-posedness of the problem of extracting inside (outside) [21]. In addition, a novel approach to 3D
3D information from a single color image, but jeopardizes reconstruction is constituted by Neural Radiance Field (NeRF)
the generalization capabilities of the method. Indeed, instance- [8] that generates 3D scenes from a sequence of RGB images
level approaches may have practical applications in scenarios knowing the poses of the cameras. Then, by sampling 3D
with a fixed number of objects, while their effectiveness drops coordinates and 2D viewing directions for each camera ray, it
with unseen instances. In the latest years, this type of network is possible to feed a neural network and get an RGB-density
has witnessed an impressive improvement concerning accuracy image as an output to reconstruct the final 3D mesh. The main
and speed: from pioneering works in YOLO-6D [13], and drawback of such an approach regards the high computational
Dope [10] to more recent approaches in GDR-Net [14], and burden, both during the training and testing phases, in the
SO-Pose [15]. Depth information can increase the accuracy order of many seconds or minutes [8]. Instant-Nerf [22] goes
of the estimation along with higher computational burden and in this direction by reducing inference time by a factor of 10 to
lower real-time performance [11]. 100, however, NeRF-based methods remain more suitable for
Passing from instance-level to category-level, the problem 3D mesh reconstruction when multiple views are available, an
can be formalized as follows: given a set of images I, a set assumption not always true in fixed camera robotics settings.

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362

REMUS et al.: I2C-NET: USING INSTANCE-LEVEL NEURAL NETWORKS FOR MONOCULAR CATEGORY-LEVEL 6D POSE ESTIMATION 3

Box Supervised
2D bbox 3D model
reconstruction
Depth ROI

depth image

depth
2D bbox, correction
module
category
intrinsics
2D object detection RGB ROI,
RGB image
category
(or segmentation)

Instance-level
6D Pose Estimation

Fig. 2: The presented method aims to find the 6D pose and 3D object size from a single RGB image, with the depth image used
to correct the estimation. It encompasses three neural network modules for 2D object detection (or segmentation), instance-level
6D pose estimation, and box-supervised 3D model reconstruction. Dashed components are used at inference time only.

III. M ETHODOLOGY Given the significant recent advances in computer vision in


the field of 2D object detection, the proposed approach relies
The proposed approach presents a framework for 6D
on an off-the-shelf object detector, i.e., YOLOv5 [6] that takes
category-level pose estimation starting from an instance-level
the full camera frame in input and provides both the class of a
network (fig. 2). The objective is to show that instance-level
given object and its 2D bounding box to the following stages
models that achieved high performance in 6D pose estimation,
of the network.
but suffer from low generalization capabilities, can still be
Cluttered scenes may cause a degradation in instance-level
effective for category-level tasks. The point is that, in general,
6D pose estimation, and consequently, at inference time, it
RGB instance-level architectures encompass backbones that
may be convenient to replace the 2D detection module with
can have general-purpose applications like classification and
a 2D instance segmentation like Detectron [7]. This is useful
detection. Therefore, they can distill the most salient features
to remove ambiguity in the single view by masking occluding
if exposed to various instances of the same category to output
objects, and increasing the attention on the object of interest.
results in a latent space. From that, other neural networks
The benefits of segmentation are opportunely analysed in the
can retrieve geometrical and shape information about the
experimental campaign.
considered objects. The work investigates in the experimental
campaign some categories that are common in the research
community such as banana, bottle, bowl, camera, can, laptop, A. Photo-realistic dataset
and mug. This section gives an overview of the proposed A custom-made purely photo-realistic dataset is generated
approach and details each module of the pipeline, as well as to train the proposed framework. The dataset is built through
the dataset generation procedure required for the training. BlenderProc, a tool based on Blender graphic engine capa-
The design of the architecture is modular, allowing changing ble of rendering RGB images with an acceptable sim-to-
the components with the most recent research advances in real gap, together with 6D object pose annotations [9]. The
computer vision as they respect the same required interface. constructed dataset contains 9900 images for each of the con-
In particular, the pipeline starts from a 2D object detection sidered categories, subdivided into 300 scenes with 33 images
module that takes as input an RGB image and outputs the 2D each, to simulate various camera viewpoints, illumination, and
bounding box of the target. Then, a 3D model reconstruction background conditions. The target is to make the network
module exploits the cropped 2D RGB image to generate the generalize to objects whose CAD is not available within the
object 3D mesh with absolute scale along the coordinate axes same category. The method employs 15 base models for the
ŝ ∈ R3 . Finally, the 6D instance-level estimation module camera object, 300 for laptop, and 100 for the other NOCS’
combines the cropped 2D RGB image and the 3D mesh to categories (bottle, bowl, can, mug) that are more related to
obtain the 3D rotation matrix R and 3D translation vector t̂. manipulation tasks. Accordingly, ShapeNet [23] is a valuable
A depth-correction module can be used in post-processing to source to gather the base CAD models. Taking inspiration
improve the estimates t̂ ∈ R3 and ŝ ∈ R3 through cropped from Shape-Prior [16] and FS-Net [17], such meshes are
depth image of the considered target object to obtain final randomly enlarged or shrunk along their coordinate axes to
translation t and scale s. augment the available dataset.

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362

4 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2023

Annotated
Poses

Annotated
Poses

Base 3D models (a) Shape Deformation (c)


Texture Randomization (b)

Photo-realistic rendering (d)

Fig. 3: Starting from a set of predefined CAD models [23] (a), a first randomization process (b) applies random textures to
each instance, a shape randomization process (c) further augments the 3D dataset along the coordinate axes, and finally, a
photo-realistic renderer (d) generates annotated RGB images.

The deformation is applied to each side of the bounding ated 3D model, and the intrinsic parameters K ∈ R3×3 of the
box within a given range that is a tunable parameter selected employed visual sensor.
depending on the required needs. A large range may be useful Given the high number of instance-level networks, the
to push for generalization over the same category, but it is choice of the particular model comes from a trade-off be-
worth not exaggerating since a too broad range can lead to tween accuracy, speed, and flexibility. In this work, Geometry-
losing the main properties of the category’s shape. The main Guided Direct Regression Network or GDR-Net is the ref-
innovation compared to other category-aware works is that erence baseline. This architecture is based on an encoder-
during training, the proposed method uses only synthetic RGB decoder encompassing a ResNet [25] backbone and a custom
images through an aleatoric set of colored textures applied to decoder to reconstruct an internal representation of the ge-
the deformed 3D models to increase generalization and reduce ometry of the seen object’s feature space. Such a capability is
overfitting to particular surface patterns. In this way, the neural appealing for an extension to category-level scenarios: the idea
network model can focus more on the shape to reconstruct the is to make the network learn the 3D geometry common to a
geometry of the objects despite the intra-class variability. To given class instead of focusing on the details concerning a few
this end, in industrial settings that work per category of object, instances of various categories, as in instance-aware settings.
the number of base models can be possibly reduced according The model is a fully differentiable architecture enabling end-
to such diversity, also addressing new incoming products in to-end training of the encoder-decoder and the Perspective-n-
line with the flexibility concept of Industry 4.0. Fig. 3 details Point (PnP) modules in charge of finding the 6D pose of the 3D
the augmentation pipeline, where 3D CAD models can be model given the features extracted from the RGB input [14].
endowed with some color and texture attributes picked at Nonetheless, the recovery of the relative and, in particular, the
random. Such modified meshes are then passed to Blenderproc absolute scale along the coordinate axes of the item from a
in charge of randomly scaling the models. In addition, fig. 3 single image is still beyond the possibilities of instance-level
shows a qualitative representation of the shape deformation networks. To this end, the 3D reconstruction module becomes
along the coordinate axes: the yellow and red wireframes essential to carry out a successful category-aware estimation.
outline the maximum and minimum deformation possible for
a given category, compared to the base model depicted in
C. 3D model reconstruction
black. Furthermore, Blenderproc places the objects in a photo-
realistic environment, where a wide variety of backgrounds This neural model is trained over the same set of photo-
and illumination conditions are simulated. Finally, the software realistic RGB images of the considered categories and learns
computes the 6D pose of each object with respect to each how to infer the 3D mesh of unseen instances from a
camera view. A strength of this method is the possibility to single viewpoint. In this work a box-supervised 3D model
develop an in-house approach: all the data generation process reconstructor is developed on top of Multi-Category Mesh
is fully under the control of the user, and further developments Reconstruction (MCMR) [26] as in fig.4. In detail, the original
are not constrained by the lack of access to real-world data architecture makes use of the weakly perspective projection
annotation tools as in [24]. model, mostly suitable when objects have a similar distance
along the camera axis [27]. Moreover, it encompasses a fully
connected network to regress 2D translation, 3D rotation, and
B. RGB 6D pose estimation 1D scaling factor. However, by delegating the pose estimation
task to the instance-level network, it is possible to remove the
This module exploits an instance-level network to extract above restriction and get the full 3D translation. Consequently,
objects’ 6D pose (3D rotation R and 3D translation t̂ with in order to retrieve the proper shape and scale of the 3D model,
respect to the camera frame) from an RGB input, the associ- it is useful to exploit prior knowledge acquired during the

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362

REMUS et al.: I2C-NET: USING INSTANCE-LEVEL NEURAL NETWORKS FOR MONOCULAR CATEGORY-LEVEL 6D POSE ESTIMATION 5

Synthetic-based training
3D Bounding Box
Supervision

3D BBox
Loss

Backpropagation
Real RGB input Meanshapes learning
SDF Based
Weighted
Deformation
Shape Selector
Network

Selected Predicted Mesh


Meanshape

Fig. 4: Box-supervised 3D model reconstruction, from monocular synthetic RGB 6D pose estimation dataset, a set of meanshapes
is learned. The newly introduced 3D bounding box loss (in red) allows backpropagating information (in green) about the object’s
size to the other neural components of the model. Dashed elements are used at inference time only.

learning phase and stored in 3D models called meanshapes. A however, thanks to 3D box supervision, its approximation ŝ
meanshape can be regarded as a latent feature that condenses is retrievable up to the range of dimensions considered in the
the salient information about the structure of the objects seen dataset generation process, as shown in fig.3.
at training time. It is possible to learn multiple meanshapes
and let a classifier select the most suitable one with respect to D. Depth correction module
the image features. In the presented approach, the loss function
in [26] is completed with the 3D bounding box supervision by As detailed in fig.2, the 3D CAD model is not accessible
introducing the term L3Dbbox = (spr gt 2 pr gt 2 at inference time. Consequently, the instance-level 6D pose
x − sx ) + (sy − sy ) +
pr gt 2
(sz −sz ) , where sk is the length of the side k ∈ {x, y, z} of estimation network needs to be completed with the above-
the object’s predicted (pr) and ground truth (gt) 3D bounding mentioned 3D reconstruction module, which is sufficient for a
boxes. In such a way, the information is back-propagated to proper 3D rotation regression, given the correct relative scale
all the neural components, and in particular to the SDF-based of the 3D model. On the other hand, a further correction may
network [21] to regress the proper scale of the object as well be performed on the 3D translation to compensate for pose
as its shape, with a good real-time performance. In order to estimation errors due to the absolute scale. To this purpose,
perform a comparison between predicted and ground truth it is convenient to exploit a rendered depth map obtained
meshes, it is convenient to align such 3D models in terms of by rendering the predicted mesh, which comes from the 3D
position. Therefore, at every training step, the predicted mesh model reconstruction module, by using the predicted rotation
is shifted so that its bounding box’s centroid becomes the 3D and translation provided by the instance-level network. At this
point [0,0,0], as by convention adopted in the NOCS dataset. point, a correction based on stereo depth may be applied, as
In principle, the actual absolute scale s would be unknown, depicted in fig. 5 and outlined in the following steps:
• finding the width σu and height σv ratios between ren-
dered and measured depth maps;
• sampling p 2D points on the rendered depth map zr
{(ur1 , v1r ), ...(urp , vpr )}. For experiments p = 8 shows to
be enough for each estimation;
• finding the corresponding points on the measured depth
map zm : (
umi = σu ui
r

vim = σv vir
Sensor Rendered
depth depth
• finding the final depth ratio as:
p
1 X zm (um m
i , vi )
σz =
p i=1 zr (uri , vir )
so that both the 3D bounding box and the translation
vector can be properly scaled to t = σz t̂ and s = σz ŝ.
Fig. 5: Correction procedure of the pose estimation, the Despite the proposed method is not designed for heavily
comparison between rendered and measured depth maps can cluttered scenes, it can address self-occlusions, as well as mild
be used to properly rescale translation coordinates. occlusions where the 2D bounding box’s size is not affected

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362

6 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2023

100.00 100.00 100.00


90.00 90.00 90.00
80.00 80.00 80.00

Average Precision [%]

Average Precision [%]


Average Precision [%]

70.00 70.00 70.00 bottle

60.00 60.00 60.00 can

50.00 50.00 50.00 bowl

40.00 40.00 40.00 mug

30.00 laptop
30.00 30.00
20.00 camera
20.00 20.00
average
10.00 10.00 10.00
0.00 0.00 0.00
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Percentage Degrees Centimeters

(a) 3D IoU error (b) Rotation error (c) Translation error


Fig. 6: Quantitative performance analysis over different NOCS categories.

by other objects, since if points are present in the sensor depth Ann. 5° 10° 10°
Method 3D25 3D50
type 5 cm 5 cm 10 cm
map but not in the rendered depth map, then another pair is NOCS [12] R+S 84.9 80.5 10.0 25.2 28.8
sampled. The assumption behind this choice is the possibility
SPD [16] R+S 83.4 77.3 21.4 54.1 56.2
to rely on a grasping policy that prioritizes less occluded
FS-Net [17] R+S 95.1 92.2 28.2 60.8 64.6
objects to reduce the clutter for the next grasping. Another
Gao et al. [31] S 68.6 27.7 7.8 17.1 -
relevant feature of this approach is the reduced complexity
CPPF [32] S 78.2 26.4 16.9 44.9 46.0
compared to other corrections methods like Iterative Closest
i2c-net S 99.96 92.46 24.62 49.42 67.05
Point (ICP) on 3D point clouds [28].
TABLE I: Performance comparison on NOCS REAL-275
IV. E XPERIMENTAL RESULTS dataset [12] between i2c-net and various state-of-the-art ap-
proaches, subdivided into methods requiring real (R) annota-
To assess the performance of the presented method, NOCS-
tions (ann.) or just synthetic (S) ones.
REAL 275 [12] dataset is used as a widespread benchmark.
The proposed architecture is trained on a custom-made photo-
data only in the learning phase. However, compared to it,
realistic dataset, which is not related to NOCS-REAL’s test
i2c-net outlines superior performances on all the reported
subset used for quantitative evaluation. In addition, YCB-V
measurements, thus highlighting that the presented approach
[11] test set is used to show that the method can also work
is quantitatively effective in a real-world scenario. It must
with categories beyond the NOCS dataset.
be noted that all the category-level state-of-the-art approaches
The following metrics are introduced to allow a comparison
reported in table I rely on depth information.
with other approaches:
Figure 6 contains a quantitative performance breakdown
• 3D Intersection over Union (3D IoU) measures the aver- over different categories in terms of average precision (AP)
age precision of the ratio between intersection and union for 3D Intersection over Union, rotation error, and translation
of the predicted and ground truth 3D bounding boxes error. Different thresholds are used to extract the curves for
[29]. each metric, where the closer the AP value is to 100%, the

• n cm, n is the average precision of the predictions better. On the other hand, figure 7 shows qualitative results
with a roto-translational error below n centimeters and related to the analysis.
n degrees; a symmetric-aware version of the metric can Translation errors show consistent performances among all
relax the error in case of ambiguities along the axes of the classes except for bowl, where the 3D reconstruction step
symmetry [29]; may face difficulties in detecting the depth of the hollow from
The reference framework for the experimental campaign is RGB images only. Further pose randomization techniques
PyTorch. Training is carried out on an NVIDIA RTX 3090 can be applied to reduce such ambiguities. Conversely, bowl,
GPU (24 GB), while inference on an RTX 3080 Laptop GPU as well as bottle and can show results above the average
(8 GB). For real-world experiments in the wild, the used device regarding the rotation error that does not penalize the different
is an RGB-D Luxonis OAK-D camera [30]. predictions along the symmetry axes, not affecting grasping
Table I shows i2c-net architecture compared to some state- and manipulation tasks. In addition, symmetric objects benefit
of-the-art approaches, over category-level metrics. The re- from fewer viewpoints required for a proper 3D shape recon-
ported performance is referred to the configuration exploiting struction. Therefore, increasing that number during training
both depth correction and instance-segmentation at inference for other categories can provide an improvement. Concerning
time. i2c-net registers the highest accuracy on 3D intersection the camera category, results show lower performances on
over union at 25% (3D25 ) and 50% (3D50 ). Concerning the rotation error, due to shape variability that is higher than in
n◦ , n cm metric, i2c-net is the best on 10◦ , 10 cm. It is other categories. Conversely, the class laptop behaves properly
worth noticing that the majority of other works rely on real- on n◦ , n cm, while lags on 3D IoU, since the articulated
world annotations during training, which hinders a proper structure of the object may lead to a low overlapping between
extension to categories not included in the NOCS dataset. reconstructed and ground truth 3D models, in spite of good
Conversely, recent methods like CPPF [32] exploit synthetic pose estimation.

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362

REMUS et al.: I2C-NET: USING INSTANCE-LEVEL NEURAL NETWORKS FOR MONOCULAR CATEGORY-LEVEL 6D POSE ESTIMATION 7

(a) NOCS REAL-275 (b) YCB-V (c) Real-world images in the wild
Fig. 7: Qualitative results on some images from NOCS REAL-275 (first column) and YCB-V (second column) test sets, and
real-world examples in the wild (third column). All the involved instances are not included in the training dataset.

Row 3D CAD Depth 5° 10° 10°


no. available correct.
2D seg
5 cm 5 cm 10 cm
available, as already presented in table I.
1 ✓ ✓ ✓ 44.93 71.40 75.25 On the other hand, comparing lines 1-2 versus 5-6 respec-
2 ✓ ✓ ✗ 36.02 62.97 66.52 tively highlights that once the 3D model needs to be recon-
3 ✓ ✗ ✓ 37.48 56.37 71.75 structed, the performance drops between 8.2% and 21.98%
4 ✓ ✗ ✗ 23.32 39.65 56.16 when depth is available, increasing up to 42.04% when i2c-
5 ✗ ✓ ✓ 24.62 49.42 67.05 net cannot access any 3D information as in rows 7-8 versus
6 ✗ ✓ ✗ 22.33 43.65 60.70 rows 3-4, where, instead, at least the 3D model is available.
7 ✗ ✗ ✓ 8.84 14.33 30.45 Despite this difference is not negligible, the generalization gain
8 ✗ ✗ ✗ 10.41 18.03 31.66 obtained through 3D mesh reconstruction is consistent since
the 3D model of the seen object is not available at inference
TABLE II: Ablation study on the impact of depth correction time.
and instance segmentation’s removal on n◦ , n cm metric tested As depicted in fig.1, GDR-Net provides unsuccessful results
on NOCS REAL, along with how much the method is capable with instances in the wild, confirming the superiority of i2c-
to compensate for the absence of 3D CAD model. net over its instance-level baseline in the presence of unseen
instances not contained in the dataset for the considered
categories (mug, bottle, and bowl).
Table II reports an ablation study to highlight how much the
performance of the presented architecture, expressed through
the n◦ , n cm metric, degrades by removing different mod- A. Real-world experiments
ules. Such an analysis covers depth correction and instance Fig. 7 highlights the capabilities of the network to gener-
segmentation to feed the network with a masked RGB input. alize beyond instances seen during training coherently to the
In addition, the first 4 rows of the table consider the case in quantitative analysis. Since no ground truth is available, the
which the ground truth 3D model is available to give a baseline oriented 3D bounding box can show qualitative results over
for the last 4 rows, quantitatively assessing how much the different categories, thanks to the intensive domain, texture,
presented method can compensate for the lack of a 3D model and shape randomization carried out prior to the training
by reconstructing it at inference time. phase. In addition, to show the extension of the pipeline to
By comparing rows 6 and 8, and rows 5 and 7, not using the categories not included in NOCS REAL-275, testing on the
depth correction module impacts between 11.92% and 36.6% YCB-V object banana is reported in fig.7b, beside the classes
in case the 3D model is not available. Similarly, instance seg- in common with the former dataset.
mentation, which can be convenient to counteract occlusions, Real-time experiments show an inference time of 60.3
provides a lighter improvement, between 2.29% and 16.72%, milliseconds (ms) on the RTX 3080 laptop GPU for each
by comparing lines 1-3-5 against 2-4-6 respectively. pose estimation, averaging over 1000 evaluations. A more
It is worth noticing that row 5 presents the best performance detailed breakdown highlights 12.8 ms for the 2D object
in a category-level condition, such as when depth correction detection with YOLOv5 small [6] and 27.7 ms for instance-
and 2D segmentation are both used, while the 3D CAD is not level pose estimation. The remaining time is split between 3D

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362

8 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2023

reconstruction (7.4 ms) and depth correction (12.4 ms) that, [11] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolu-
together, introduce a 50% overhead. However, even though tional neural network for 6d object pose estimation in cluttered scenes,”
in Robotics: Science and Systems (RSS), 2018.
this represents the price for moving from an instance-level to [12] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas,
a more general category-level pipeline, real-time tasks can still “Normalized object coordinate space for category-level 6d object pose
be carried out without further optimization. and size estimation,” in 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2019, pp. 2637–2646.
[13] B. Tekin, S. N. Sinha, and P. Fua, “Real-Time Seamless Single Shot 6D
Object Pose Prediction,” in CVPR, 2018.
V. C ONCLUSIONS [14] G. Wang, F. Manhardt, F. Tombari, and X. Ji, “Gdr-net: Geometry-
Two of the primary tasks for robots in many applications are guided direct regression network for monocular 6d object pose estima-
tion,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern
object grasping and manipulation. Object detection and pose Recognition (CVPR), 2021, pp. 16 606–16 616.
estimation are fundamental skills to give robots the ability [15] Y. Di, F. Manhardt, G. Wang, X. Ji, N. Navab, and F. Tombari, “So-pose:
to adapt to the changing environment and grasp a variety of Exploiting self-occlusion for direct 6d pose estimation,” in Proceedings
of the IEEE/CVF International Conference on Computer Vision (ICCV),
objects that present different sizes, material properties, and October 2021, pp. 12 396–12 405.
texture appearances in cluttered scenes with different lighting [16] M. Tian, M. H. Ang Jr, and G. H. Lee, “Shape prior deformation for
conditions. categorical 6d object pose and size estimation,” in Proceedings of the
European Conference on Computer Vision (ECCV), August 2020.
This work presents i2c-net to extract the 6D pose of mul- [17] W. Chen, X. Jia, H. J. Chang, J. Duan, L. Shen, and A. Leonardis,
tiple objects belonging to different categories, starting from “Fs-net: Fast shape-based network for category-level 6d object pose
an instance-level pose estimation network and exploiting a estimation with decoupled rotation mechanism,” in 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2021,
custom-made synthetic dataset for training. Such a dataset uses pp. 1581–1590.
some base CAD models for known objects, encompassing the [18] Y. Lin, J. Tremblay, S. Tyree, P. A. Vela, and S. Birchfield, “Single-
diverse shapes of the objects in that category as a starting stage keypoint-based category-level object pose estimation from an RGB
image,” in IEEE International Conference on Robotics and Automation
point for a deformation procedure, which provides new object (ICRA), 2022.
models enriched with real textures for domain randomization [19] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang, “Pixel2mesh:
purposes. The selected instance-level pose estimation network Generating 3d mesh models from single rgb images,” in ECCV, 2018.
[20] A. Tewari, J. Thies, B. Mildenhall, P. Srinivasan, Tretschk et al.,
can be trained on such augmented photo-realistic images, and, “Advances in neural rendering,” Computer Graphics Forum, vol. 41,
at inference time, it is used in combination with a 3D mesh no. 2, pp. 703–735, 2022.
reconstruction network achieving category-level capabilities. [21] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove,
“Deepsdf: Learning continuous signed distance functions for shape
Depth information is used for post-processing as a correction. representation,” in The IEEE Conference on Computer Vision and
Tests conducted on real objects of the YCB-V and NOCS Pattern Recognition (CVPR), June 2019.
REAL-275 datasets show the high accuracy of the proposed [22] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics
primitives with a multiresolution hash encoding,” ACM Trans. Graph.,
method as well as good real-time performances. vol. 41, no. 4, pp. 102:1–102:15, Jul. 2022.
[23] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang
et al., “Shapenet: An information-rich 3d model repository,” ArXiv, vol.
R EFERENCES abs/1512.03012, 2015.
[24] A. Ahmadyan, L. Zhang, A. Ablavatski, J. Wei, and M. Grundmann,
[1] S. D’Avella, C. A. Avizzano, and P. Tripicchio, “Ros-industrial based “Objectron: A large scale dataset of object-centric videos in the wild
robotic cell for industry 4.0: Eye-in-hand stereo camera and visual with pose annotations,” Proceedings of the IEEE Conference on Com-
servoing for flexible, fast, and accurate picking and hooking in the puter Vision and Pattern Recognition, 2021.
production line,” Robotics and Computer-Integrated Manufacturing, [25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
vol. 80, p. 102453, 2023. recognition,” 2016 IEEE Conference on Computer Vision and Pattern
[2] A. Ajoudani, A. M. Zanchettin, S. Ivaldi, A. Albu-Schäffer, K. Kosuge, Recognition (CVPR), pp. 770–778, 2016.
and O. Khatib, “Progress and prospects of the human–robot collabora- [26] A. Simoni, S. Pini, R. Vezzani, and R. Cucchiara, “Multi-category
tion,” Autonomous Robots, vol. 42, no. 5, pp. 957–975, 2018. mesh reconstruction from image collections,” in 2021 International
[3] S. D’Avella, P. Tripicchio, and C. A. Avizzano, “A study on picking Conference on 3D Vision (3DV). IEEE, 2021, pp. 1321–1330.
objects in cluttered environments: Exploiting depth features for a custom [27] Z. Zhang, Weak Perspective Projection. Boston, MA: Springer US,
low-cost universal jamming gripper,” Robotics and Computer-Integrated 2014, pp. 877–883.
Manufacturing, vol. 63, p. 101888, 2020. [28] P. Besl and H. McKay, “A method for registration of 3-d shapes.
[4] G. Du, K. Wang, S. Lian, and K. Zhao, “Vision-based robotic grasping ieee trans pattern anal mach intell,” Pattern Analysis and Machine
from object localization, object pose estimation to grasp estimation for Intelligence, IEEE Transactions on, vol. 14, pp. 239–256, 03 1992.
parallel grippers: a review,” Artificial Intelligence Review, vol. 54, no. 3, [29] J. Z. Fan, Y. Zhu, Y. He, Q. Sun, H. Liu, and J. He, “Deep learning
pp. 1677–1734, 2021. on monocular object pose detection and tracking: A comprehensive
[5] X. Wu, D. Sahoo, and S. C. Hoi, “Recent advances in deep learning for overview,” ACM Computing Surveys (CSUR), 2022.
object detection,” Neurocomputing, vol. 396, pp. 39–64, 2020. [30] Luxonis, “Oak-d, depth-ai documentation,” 2022, [On-
[6] G. Jocher, A. Chaurasia, A. Stoken, J. Borovec et al., “ultralytics/yolov5: line; accessed 30-August-2022]. [Online]. Available:
v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO Export and https://shop.luxonis.com/products/oak-d
Inference,” Feb. 2022. [31] G. Gao, M. Lauri, Y. Wang, X. Hu, J. Zhang, and S. Frintrop, “6d object
[7] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2,” pose regression via supervised learning on point clouds,” in 2020 IEEE
https://github.com/facebookresearch/detectron2, 2019. International Conference on Robotics and Automation (ICRA), 2020, pp.
[8] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, 3643–3649.
and R. Ng, “Nerf: Representing scenes as neural radiance fields for view [32] Y. You, R. Shi, W. Wang, and C. Lu, “Cppf: Towards robust category-
synthesis,” in ECCV, 2020. level 9d pose estimation in the wild,” in Proceedings of the IEEE/CVF
[9] M. Denninger, M. Sundermeyer, D. Winkelbauer, Y. Zidan, D. Olefir, Conference on Computer Vision and Pattern Recognition, 2022.
M. Elbadrawy, A. Lodhi, and H. Katam, “Blenderproc,” arXiv preprint
arXiv:1911.01911, 2019.
[10] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birch-
field, “Deep object pose estimation for semantic robotic grasping of
household objects,” in Conference on Robot Learning (CoRL), 2018.

© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.

You might also like