I2c-Net Using Instance-Level Neural Networks For M
I2c-Net Using Instance-Level Neural Networks For M
This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2023 1
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362
2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2023
possible from RGB images and available CAD models and of categories of objects sharing some common properties C =
use depth information for post-processing as a correction. ∪Nk=1 Ck , and a set of objects O belonging to such different
Considering some known categories, like, for example, bottles categories, the objective is to find, for each RGB(-D) image
or cans, a few baseCAD models, i.e., CAD models for known Ij ∈ I and object instance Mi ∈ O the mapping (Ij , Mi ) 7−→
objects that encompass the diverse shapes of the objects in (Ck , Rij , tij , sij ), where Ck is the category which the object
that category, are used as a starting point for a deformation Mi belongs to, and sij ∈ R3 its size. It is worth noticing that
procedure. Such deformations generate new object models the CAD model is not available for each object instance during
that are enriched with real textures for domain randomization inference otherwise the problem ends up in the instance-level
purposes. The instance-level pose estimation network can be setting. For that reason, the sij term represents the size of the
trained on such augmented photo-realistic images, and at 3D bounding box tightly surrounding the object to solve the
inference time it is used in combination with a 3D mesh scale ambiguity.
reconstruction network achieving category-level capabilities.
Several tests are conducted on real objects of the YCB-V [11]
Such a scale factor is relevant for robotic grasping scenarios
and NOCS-REAL 275 [12] datasets to assess the performance
to determine whether the target object can fit the gripper open-
of the proposed framework. Figure 1 shows a qualitative
ing. This family of problems leaves space for investigation,
comparison between the proposed framework and the original
especially concerning the exploitation of RGB information.
GDR-Net, which demonstrated to have promising performance
Indeed, state-of-the-art approaches, like Shape-Prior [16] (or
as an instance-level pose estimation network.
SPD), and FS-Net [17], instead of using only color images
The rest of the paper is organized as follows: section II
during training, extract a point cloud from the depth image
describes the considered problem, related works, and open
of the observed object and apply a set of 3D deformations
challenges; section III details the structure of the presented
to augment the available data and catch intra-category salient
pipeline, from a general viewpoint to a fine-grained description
features. Existing methods can also be classified as single-
of the neural network models employed; section IV collects
stage or multiple-stage estimators. In the former, the training
experimental results with qualitative and quantitative compar-
process of both detection and pose estimation outputs is
isons with state-of-the-art approaches; and section V is devoted
performed jointly, as in the case of key-point-based approaches
to conclusions by summarizing the contribution of the work.
presented in [10] or [18]. Multiple-stage methods like [14] or
[15], instead, can benefit from higher modularity considering
II. P ROBLEM DEFINITION AND R ELATED WORK the vast improvement that 2D detectors have been going
through.
One of the key aspects of autonomous and reliable grasping
is the pose estimation of the target object. Existing approaches
that extract the 6D pose information from a single RGB(- A dual problem to 6D pose estimation is 3D model re-
Depth) image can be mainly distinguished in instance-level construction. Explicit shape representation [19] is a class of
and category-level methods. The first type is biased by the approaches that approximates a surface as a function of 2D
CAD model and texture properties of the object used during coordinates and enhances the granularity of this approximation
training and does not properly work if changes are applied by increasing the number of edges, triangles, or vertices
to such an object. The instance-level pose estimation problem at the expense of additional processing time. Variations in
can be formalized as follows: given a set of images I and a the objects’ topology within the same category can hamper
set of objects O for which the CAD model is available, the performances due to the possible presence of holes and gaps
objective is to find, for each RGB(-D) image Ij ∈ I and object inside the 3D model, thus leaving space for the so-called im-
instance Mi ∈ O, the mapping (Ij , Mi ) 7−→ (Rij , tij ), where plicit surface representations [20]. Signed Distance Functions
Rij ∈ SO(3) is the object rotation matrix and tij ∈ R3 is the (SDF) are the most widespread implicit model that computes
3D translation vector for the particular object instance with the distance to the closest surface for each considered 3D
respect to the camera frame. The presence of the 3D CAD point and assigns a positive (negative) sign if the point is
model resolves the ill-posedness of the problem of extracting inside (outside) [21]. In addition, a novel approach to 3D
3D information from a single color image, but jeopardizes reconstruction is constituted by Neural Radiance Field (NeRF)
the generalization capabilities of the method. Indeed, instance- [8] that generates 3D scenes from a sequence of RGB images
level approaches may have practical applications in scenarios knowing the poses of the cameras. Then, by sampling 3D
with a fixed number of objects, while their effectiveness drops coordinates and 2D viewing directions for each camera ray, it
with unseen instances. In the latest years, this type of network is possible to feed a neural network and get an RGB-density
has witnessed an impressive improvement concerning accuracy image as an output to reconstruct the final 3D mesh. The main
and speed: from pioneering works in YOLO-6D [13], and drawback of such an approach regards the high computational
Dope [10] to more recent approaches in GDR-Net [14], and burden, both during the training and testing phases, in the
SO-Pose [15]. Depth information can increase the accuracy order of many seconds or minutes [8]. Instant-Nerf [22] goes
of the estimation along with higher computational burden and in this direction by reducing inference time by a factor of 10 to
lower real-time performance [11]. 100, however, NeRF-based methods remain more suitable for
Passing from instance-level to category-level, the problem 3D mesh reconstruction when multiple views are available, an
can be formalized as follows: given a set of images I, a set assumption not always true in fixed camera robotics settings.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362
REMUS et al.: I2C-NET: USING INSTANCE-LEVEL NEURAL NETWORKS FOR MONOCULAR CATEGORY-LEVEL 6D POSE ESTIMATION 3
Box Supervised
2D bbox 3D model
reconstruction
Depth ROI
depth image
depth
2D bbox, correction
module
category
intrinsics
2D object detection RGB ROI,
RGB image
category
(or segmentation)
Instance-level
6D Pose Estimation
Fig. 2: The presented method aims to find the 6D pose and 3D object size from a single RGB image, with the depth image used
to correct the estimation. It encompasses three neural network modules for 2D object detection (or segmentation), instance-level
6D pose estimation, and box-supervised 3D model reconstruction. Dashed components are used at inference time only.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362
4 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2023
Annotated
Poses
Annotated
Poses
Fig. 3: Starting from a set of predefined CAD models [23] (a), a first randomization process (b) applies random textures to
each instance, a shape randomization process (c) further augments the 3D dataset along the coordinate axes, and finally, a
photo-realistic renderer (d) generates annotated RGB images.
The deformation is applied to each side of the bounding ated 3D model, and the intrinsic parameters K ∈ R3×3 of the
box within a given range that is a tunable parameter selected employed visual sensor.
depending on the required needs. A large range may be useful Given the high number of instance-level networks, the
to push for generalization over the same category, but it is choice of the particular model comes from a trade-off be-
worth not exaggerating since a too broad range can lead to tween accuracy, speed, and flexibility. In this work, Geometry-
losing the main properties of the category’s shape. The main Guided Direct Regression Network or GDR-Net is the ref-
innovation compared to other category-aware works is that erence baseline. This architecture is based on an encoder-
during training, the proposed method uses only synthetic RGB decoder encompassing a ResNet [25] backbone and a custom
images through an aleatoric set of colored textures applied to decoder to reconstruct an internal representation of the ge-
the deformed 3D models to increase generalization and reduce ometry of the seen object’s feature space. Such a capability is
overfitting to particular surface patterns. In this way, the neural appealing for an extension to category-level scenarios: the idea
network model can focus more on the shape to reconstruct the is to make the network learn the 3D geometry common to a
geometry of the objects despite the intra-class variability. To given class instead of focusing on the details concerning a few
this end, in industrial settings that work per category of object, instances of various categories, as in instance-aware settings.
the number of base models can be possibly reduced according The model is a fully differentiable architecture enabling end-
to such diversity, also addressing new incoming products in to-end training of the encoder-decoder and the Perspective-n-
line with the flexibility concept of Industry 4.0. Fig. 3 details Point (PnP) modules in charge of finding the 6D pose of the 3D
the augmentation pipeline, where 3D CAD models can be model given the features extracted from the RGB input [14].
endowed with some color and texture attributes picked at Nonetheless, the recovery of the relative and, in particular, the
random. Such modified meshes are then passed to Blenderproc absolute scale along the coordinate axes of the item from a
in charge of randomly scaling the models. In addition, fig. 3 single image is still beyond the possibilities of instance-level
shows a qualitative representation of the shape deformation networks. To this end, the 3D reconstruction module becomes
along the coordinate axes: the yellow and red wireframes essential to carry out a successful category-aware estimation.
outline the maximum and minimum deformation possible for
a given category, compared to the base model depicted in
C. 3D model reconstruction
black. Furthermore, Blenderproc places the objects in a photo-
realistic environment, where a wide variety of backgrounds This neural model is trained over the same set of photo-
and illumination conditions are simulated. Finally, the software realistic RGB images of the considered categories and learns
computes the 6D pose of each object with respect to each how to infer the 3D mesh of unseen instances from a
camera view. A strength of this method is the possibility to single viewpoint. In this work a box-supervised 3D model
develop an in-house approach: all the data generation process reconstructor is developed on top of Multi-Category Mesh
is fully under the control of the user, and further developments Reconstruction (MCMR) [26] as in fig.4. In detail, the original
are not constrained by the lack of access to real-world data architecture makes use of the weakly perspective projection
annotation tools as in [24]. model, mostly suitable when objects have a similar distance
along the camera axis [27]. Moreover, it encompasses a fully
connected network to regress 2D translation, 3D rotation, and
B. RGB 6D pose estimation 1D scaling factor. However, by delegating the pose estimation
task to the instance-level network, it is possible to remove the
This module exploits an instance-level network to extract above restriction and get the full 3D translation. Consequently,
objects’ 6D pose (3D rotation R and 3D translation t̂ with in order to retrieve the proper shape and scale of the 3D model,
respect to the camera frame) from an RGB input, the associ- it is useful to exploit prior knowledge acquired during the
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362
REMUS et al.: I2C-NET: USING INSTANCE-LEVEL NEURAL NETWORKS FOR MONOCULAR CATEGORY-LEVEL 6D POSE ESTIMATION 5
Synthetic-based training
3D Bounding Box
Supervision
3D BBox
Loss
Backpropagation
Real RGB input Meanshapes learning
SDF Based
Weighted
Deformation
Shape Selector
Network
Fig. 4: Box-supervised 3D model reconstruction, from monocular synthetic RGB 6D pose estimation dataset, a set of meanshapes
is learned. The newly introduced 3D bounding box loss (in red) allows backpropagating information (in green) about the object’s
size to the other neural components of the model. Dashed elements are used at inference time only.
learning phase and stored in 3D models called meanshapes. A however, thanks to 3D box supervision, its approximation ŝ
meanshape can be regarded as a latent feature that condenses is retrievable up to the range of dimensions considered in the
the salient information about the structure of the objects seen dataset generation process, as shown in fig.3.
at training time. It is possible to learn multiple meanshapes
and let a classifier select the most suitable one with respect to D. Depth correction module
the image features. In the presented approach, the loss function
in [26] is completed with the 3D bounding box supervision by As detailed in fig.2, the 3D CAD model is not accessible
introducing the term L3Dbbox = (spr gt 2 pr gt 2 at inference time. Consequently, the instance-level 6D pose
x − sx ) + (sy − sy ) +
pr gt 2
(sz −sz ) , where sk is the length of the side k ∈ {x, y, z} of estimation network needs to be completed with the above-
the object’s predicted (pr) and ground truth (gt) 3D bounding mentioned 3D reconstruction module, which is sufficient for a
boxes. In such a way, the information is back-propagated to proper 3D rotation regression, given the correct relative scale
all the neural components, and in particular to the SDF-based of the 3D model. On the other hand, a further correction may
network [21] to regress the proper scale of the object as well be performed on the 3D translation to compensate for pose
as its shape, with a good real-time performance. In order to estimation errors due to the absolute scale. To this purpose,
perform a comparison between predicted and ground truth it is convenient to exploit a rendered depth map obtained
meshes, it is convenient to align such 3D models in terms of by rendering the predicted mesh, which comes from the 3D
position. Therefore, at every training step, the predicted mesh model reconstruction module, by using the predicted rotation
is shifted so that its bounding box’s centroid becomes the 3D and translation provided by the instance-level network. At this
point [0,0,0], as by convention adopted in the NOCS dataset. point, a correction based on stereo depth may be applied, as
In principle, the actual absolute scale s would be unknown, depicted in fig. 5 and outlined in the following steps:
• finding the width σu and height σv ratios between ren-
dered and measured depth maps;
• sampling p 2D points on the rendered depth map zr
{(ur1 , v1r ), ...(urp , vpr )}. For experiments p = 8 shows to
be enough for each estimation;
• finding the corresponding points on the measured depth
map zm : (
umi = σu ui
r
vim = σv vir
Sensor Rendered
depth depth
• finding the final depth ratio as:
p
1 X zm (um m
i , vi )
σz =
p i=1 zr (uri , vir )
so that both the 3D bounding box and the translation
vector can be properly scaled to t = σz t̂ and s = σz ŝ.
Fig. 5: Correction procedure of the pose estimation, the Despite the proposed method is not designed for heavily
comparison between rendered and measured depth maps can cluttered scenes, it can address self-occlusions, as well as mild
be used to properly rescale translation coordinates. occlusions where the 2D bounding box’s size is not affected
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362
6 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2023
30.00 laptop
30.00 30.00
20.00 camera
20.00 20.00
average
10.00 10.00 10.00
0.00 0.00 0.00
5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Percentage Degrees Centimeters
by other objects, since if points are present in the sensor depth Ann. 5° 10° 10°
Method 3D25 3D50
type 5 cm 5 cm 10 cm
map but not in the rendered depth map, then another pair is NOCS [12] R+S 84.9 80.5 10.0 25.2 28.8
sampled. The assumption behind this choice is the possibility
SPD [16] R+S 83.4 77.3 21.4 54.1 56.2
to rely on a grasping policy that prioritizes less occluded
FS-Net [17] R+S 95.1 92.2 28.2 60.8 64.6
objects to reduce the clutter for the next grasping. Another
Gao et al. [31] S 68.6 27.7 7.8 17.1 -
relevant feature of this approach is the reduced complexity
CPPF [32] S 78.2 26.4 16.9 44.9 46.0
compared to other corrections methods like Iterative Closest
i2c-net S 99.96 92.46 24.62 49.42 67.05
Point (ICP) on 3D point clouds [28].
TABLE I: Performance comparison on NOCS REAL-275
IV. E XPERIMENTAL RESULTS dataset [12] between i2c-net and various state-of-the-art ap-
proaches, subdivided into methods requiring real (R) annota-
To assess the performance of the presented method, NOCS-
tions (ann.) or just synthetic (S) ones.
REAL 275 [12] dataset is used as a widespread benchmark.
The proposed architecture is trained on a custom-made photo-
data only in the learning phase. However, compared to it,
realistic dataset, which is not related to NOCS-REAL’s test
i2c-net outlines superior performances on all the reported
subset used for quantitative evaluation. In addition, YCB-V
measurements, thus highlighting that the presented approach
[11] test set is used to show that the method can also work
is quantitatively effective in a real-world scenario. It must
with categories beyond the NOCS dataset.
be noted that all the category-level state-of-the-art approaches
The following metrics are introduced to allow a comparison
reported in table I rely on depth information.
with other approaches:
Figure 6 contains a quantitative performance breakdown
• 3D Intersection over Union (3D IoU) measures the aver- over different categories in terms of average precision (AP)
age precision of the ratio between intersection and union for 3D Intersection over Union, rotation error, and translation
of the predicted and ground truth 3D bounding boxes error. Different thresholds are used to extract the curves for
[29]. each metric, where the closer the AP value is to 100%, the
◦
• n cm, n is the average precision of the predictions better. On the other hand, figure 7 shows qualitative results
with a roto-translational error below n centimeters and related to the analysis.
n degrees; a symmetric-aware version of the metric can Translation errors show consistent performances among all
relax the error in case of ambiguities along the axes of the classes except for bowl, where the 3D reconstruction step
symmetry [29]; may face difficulties in detecting the depth of the hollow from
The reference framework for the experimental campaign is RGB images only. Further pose randomization techniques
PyTorch. Training is carried out on an NVIDIA RTX 3090 can be applied to reduce such ambiguities. Conversely, bowl,
GPU (24 GB), while inference on an RTX 3080 Laptop GPU as well as bottle and can show results above the average
(8 GB). For real-world experiments in the wild, the used device regarding the rotation error that does not penalize the different
is an RGB-D Luxonis OAK-D camera [30]. predictions along the symmetry axes, not affecting grasping
Table I shows i2c-net architecture compared to some state- and manipulation tasks. In addition, symmetric objects benefit
of-the-art approaches, over category-level metrics. The re- from fewer viewpoints required for a proper 3D shape recon-
ported performance is referred to the configuration exploiting struction. Therefore, increasing that number during training
both depth correction and instance-segmentation at inference for other categories can provide an improvement. Concerning
time. i2c-net registers the highest accuracy on 3D intersection the camera category, results show lower performances on
over union at 25% (3D25 ) and 50% (3D50 ). Concerning the rotation error, due to shape variability that is higher than in
n◦ , n cm metric, i2c-net is the best on 10◦ , 10 cm. It is other categories. Conversely, the class laptop behaves properly
worth noticing that the majority of other works rely on real- on n◦ , n cm, while lags on 3D IoU, since the articulated
world annotations during training, which hinders a proper structure of the object may lead to a low overlapping between
extension to categories not included in the NOCS dataset. reconstructed and ground truth 3D models, in spite of good
Conversely, recent methods like CPPF [32] exploit synthetic pose estimation.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362
REMUS et al.: I2C-NET: USING INSTANCE-LEVEL NEURAL NETWORKS FOR MONOCULAR CATEGORY-LEVEL 6D POSE ESTIMATION 7
(a) NOCS REAL-275 (b) YCB-V (c) Real-world images in the wild
Fig. 7: Qualitative results on some images from NOCS REAL-275 (first column) and YCB-V (second column) test sets, and
real-world examples in the wild (third column). All the involved instances are not included in the training dataset.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.
This article has been accepted for publication in IEEE Robotics and Automation Letters. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/LRA.2023.3240362
8 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JANUARY, 2023
reconstruction (7.4 ms) and depth correction (12.4 ms) that, [11] Y. Xiang, T. Schmidt, V. Narayanan, and D. Fox, “Posecnn: A convolu-
together, introduce a 50% overhead. However, even though tional neural network for 6d object pose estimation in cluttered scenes,”
in Robotics: Science and Systems (RSS), 2018.
this represents the price for moving from an instance-level to [12] H. Wang, S. Sridhar, J. Huang, J. Valentin, S. Song, and L. J. Guibas,
a more general category-level pipeline, real-time tasks can still “Normalized object coordinate space for category-level 6d object pose
be carried out without further optimization. and size estimation,” in 2019 IEEE/CVF Conference on Computer Vision
and Pattern Recognition (CVPR), 2019, pp. 2637–2646.
[13] B. Tekin, S. N. Sinha, and P. Fua, “Real-Time Seamless Single Shot 6D
Object Pose Prediction,” in CVPR, 2018.
V. C ONCLUSIONS [14] G. Wang, F. Manhardt, F. Tombari, and X. Ji, “Gdr-net: Geometry-
Two of the primary tasks for robots in many applications are guided direct regression network for monocular 6d object pose estima-
tion,” in 2021 IEEE/CVF Conference on Computer Vision and Pattern
object grasping and manipulation. Object detection and pose Recognition (CVPR), 2021, pp. 16 606–16 616.
estimation are fundamental skills to give robots the ability [15] Y. Di, F. Manhardt, G. Wang, X. Ji, N. Navab, and F. Tombari, “So-pose:
to adapt to the changing environment and grasp a variety of Exploiting self-occlusion for direct 6d pose estimation,” in Proceedings
of the IEEE/CVF International Conference on Computer Vision (ICCV),
objects that present different sizes, material properties, and October 2021, pp. 12 396–12 405.
texture appearances in cluttered scenes with different lighting [16] M. Tian, M. H. Ang Jr, and G. H. Lee, “Shape prior deformation for
conditions. categorical 6d object pose and size estimation,” in Proceedings of the
European Conference on Computer Vision (ECCV), August 2020.
This work presents i2c-net to extract the 6D pose of mul- [17] W. Chen, X. Jia, H. J. Chang, J. Duan, L. Shen, and A. Leonardis,
tiple objects belonging to different categories, starting from “Fs-net: Fast shape-based network for category-level 6d object pose
an instance-level pose estimation network and exploiting a estimation with decoupled rotation mechanism,” in 2021 IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), 2021,
custom-made synthetic dataset for training. Such a dataset uses pp. 1581–1590.
some base CAD models for known objects, encompassing the [18] Y. Lin, J. Tremblay, S. Tyree, P. A. Vela, and S. Birchfield, “Single-
diverse shapes of the objects in that category as a starting stage keypoint-based category-level object pose estimation from an RGB
image,” in IEEE International Conference on Robotics and Automation
point for a deformation procedure, which provides new object (ICRA), 2022.
models enriched with real textures for domain randomization [19] N. Wang, Y. Zhang, Z. Li, Y. Fu, W. Liu, and Y.-G. Jiang, “Pixel2mesh:
purposes. The selected instance-level pose estimation network Generating 3d mesh models from single rgb images,” in ECCV, 2018.
[20] A. Tewari, J. Thies, B. Mildenhall, P. Srinivasan, Tretschk et al.,
can be trained on such augmented photo-realistic images, and, “Advances in neural rendering,” Computer Graphics Forum, vol. 41,
at inference time, it is used in combination with a 3D mesh no. 2, pp. 703–735, 2022.
reconstruction network achieving category-level capabilities. [21] J. J. Park, P. Florence, J. Straub, R. Newcombe, and S. Lovegrove,
“Deepsdf: Learning continuous signed distance functions for shape
Depth information is used for post-processing as a correction. representation,” in The IEEE Conference on Computer Vision and
Tests conducted on real objects of the YCB-V and NOCS Pattern Recognition (CVPR), June 2019.
REAL-275 datasets show the high accuracy of the proposed [22] T. Müller, A. Evans, C. Schied, and A. Keller, “Instant neural graphics
primitives with a multiresolution hash encoding,” ACM Trans. Graph.,
method as well as good real-time performances. vol. 41, no. 4, pp. 102:1–102:15, Jul. 2022.
[23] A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang
et al., “Shapenet: An information-rich 3d model repository,” ArXiv, vol.
R EFERENCES abs/1512.03012, 2015.
[24] A. Ahmadyan, L. Zhang, A. Ablavatski, J. Wei, and M. Grundmann,
[1] S. D’Avella, C. A. Avizzano, and P. Tripicchio, “Ros-industrial based “Objectron: A large scale dataset of object-centric videos in the wild
robotic cell for industry 4.0: Eye-in-hand stereo camera and visual with pose annotations,” Proceedings of the IEEE Conference on Com-
servoing for flexible, fast, and accurate picking and hooking in the puter Vision and Pattern Recognition, 2021.
production line,” Robotics and Computer-Integrated Manufacturing, [25] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
vol. 80, p. 102453, 2023. recognition,” 2016 IEEE Conference on Computer Vision and Pattern
[2] A. Ajoudani, A. M. Zanchettin, S. Ivaldi, A. Albu-Schäffer, K. Kosuge, Recognition (CVPR), pp. 770–778, 2016.
and O. Khatib, “Progress and prospects of the human–robot collabora- [26] A. Simoni, S. Pini, R. Vezzani, and R. Cucchiara, “Multi-category
tion,” Autonomous Robots, vol. 42, no. 5, pp. 957–975, 2018. mesh reconstruction from image collections,” in 2021 International
[3] S. D’Avella, P. Tripicchio, and C. A. Avizzano, “A study on picking Conference on 3D Vision (3DV). IEEE, 2021, pp. 1321–1330.
objects in cluttered environments: Exploiting depth features for a custom [27] Z. Zhang, Weak Perspective Projection. Boston, MA: Springer US,
low-cost universal jamming gripper,” Robotics and Computer-Integrated 2014, pp. 877–883.
Manufacturing, vol. 63, p. 101888, 2020. [28] P. Besl and H. McKay, “A method for registration of 3-d shapes.
[4] G. Du, K. Wang, S. Lian, and K. Zhao, “Vision-based robotic grasping ieee trans pattern anal mach intell,” Pattern Analysis and Machine
from object localization, object pose estimation to grasp estimation for Intelligence, IEEE Transactions on, vol. 14, pp. 239–256, 03 1992.
parallel grippers: a review,” Artificial Intelligence Review, vol. 54, no. 3, [29] J. Z. Fan, Y. Zhu, Y. He, Q. Sun, H. Liu, and J. He, “Deep learning
pp. 1677–1734, 2021. on monocular object pose detection and tracking: A comprehensive
[5] X. Wu, D. Sahoo, and S. C. Hoi, “Recent advances in deep learning for overview,” ACM Computing Surveys (CSUR), 2022.
object detection,” Neurocomputing, vol. 396, pp. 39–64, 2020. [30] Luxonis, “Oak-d, depth-ai documentation,” 2022, [On-
[6] G. Jocher, A. Chaurasia, A. Stoken, J. Borovec et al., “ultralytics/yolov5: line; accessed 30-August-2022]. [Online]. Available:
v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO Export and https://shop.luxonis.com/products/oak-d
Inference,” Feb. 2022. [31] G. Gao, M. Lauri, Y. Wang, X. Hu, J. Zhang, and S. Frintrop, “6d object
[7] Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2,” pose regression via supervised learning on point clouds,” in 2020 IEEE
https://github.com/facebookresearch/detectron2, 2019. International Conference on Robotics and Automation (ICRA), 2020, pp.
[8] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, 3643–3649.
and R. Ng, “Nerf: Representing scenes as neural radiance fields for view [32] Y. You, R. Shi, W. Wang, and C. Lu, “Cppf: Towards robust category-
synthesis,” in ECCV, 2020. level 9d pose estimation in the wild,” in Proceedings of the IEEE/CVF
[9] M. Denninger, M. Sundermeyer, D. Winkelbauer, Y. Zidan, D. Olefir, Conference on Computer Vision and Pattern Recognition, 2022.
M. Elbadrawy, A. Lodhi, and H. Katam, “Blenderproc,” arXiv preprint
arXiv:1911.01911, 2019.
[10] J. Tremblay, T. To, B. Sundaralingam, Y. Xiang, D. Fox, and S. Birch-
field, “Deep object pose estimation for semantic robotic grasping of
household objects,” in Conference on Robot Learning (CoRL), 2018.
© 2023 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See https://www.ieee.org/publications/rights/index.html for more information.