When Transformer Meets Robotic Grasping Exploits Context For Efficient Grasp Detection
When Transformer Meets Robotic Grasping Exploits Context For Efficient Grasp Detection
3, JULY 2022
Abstract—In this letter, we present a transformer-based architec- have to concentrate on not only partial geometric information
ture, namely TF-Grasp, for robotic grasp detection. The developed but also the entire visual appearance of the object. Particularly in
TF-Grasp framework has two elaborate designs making it well unstructured and cluttered environments, dealing with variations
suitable for visual grasping tasks. The first key design is that
we adopt the local window attention to capture local contextual in shape and position (e.g., occlusion) and also the spatial
information and detailed features of graspable objects. Then, we relationship with other objects are critical to the performance
apply the cross window attention to model the long-term depen- of grasp detection. Therefore, this work is particularly moti-
dencies between distant pixels. Object knowledge, environmental vated to investigate grasp detection that takes into account both
configuration, and relationships between different visual entities local neighbor pixels and long-distance relationships in spatial
are aggregated for subsequent grasp detection. The second key
design is that we build a hierarchical encoder-decoder architec- dimensions.
ture with skip-connections, delivering shallow features from the Most modern grasp detectors [3], [5] are based on con-
encoder to decoder to enable a multi-scale feature fusion. Due to volutional neural networks (CNNs) which emerge as the de
the powerful attention mechanism, TF-Grasp can simultaneously facto standard for processing visual robotic grasping. However,
obtain the local information (i.e., the contours of objects), and current CNNs are composed of individual convolution kernels,
model long-term connections such as the relationships between
distinct visual concepts in clutter. Extensive computational exper- which are more inclined to concentrate on local level informa-
iments demonstrate that TF-Grasp achieves competitive results tion. Also, the convolution kernels in a layer of CNN are viewed
versus state-of-art grasping convolutional models and attains a as independent counterparts without mutual information fusion.
higher accuracy of 97.99% and 94.6% on Cornell and Jacquard Generally, to maintain a large receptive field, CNNs have to
grasping datasets, respectively. Real-world experiments using a repeatedly stack convolutional layers, which reduce the spatial
7DoF Franka Emika Panda robot also demonstrate its capability of
grasping unseen objects in a variety of scenarios. The code is avail- resolution and inevitably results in the loss of global details and
able at https://github.com/WangShaoSUN/grasp-transformer. degraded performance.
Recently, as a novel approach to handle natural language
Index Terms—Grasp detection, robotic grasping, vision
transformer.
processing and computer vision, the transformer [6]–[8] demon-
strates remarkable success. The widely adopted attention mech-
anisms [6] of transformers in sequence modeling provide an el-
I. INTRODUCTION egant resolution that can better convey the fusion of information
ATA-DRIVEN methodologies such as deep learning have across global sequences. In fact, as robots are deployed in more
D become the mainstream methods for robotic visual sensing
tasks such as indoor localization [1], trajectory prediction [2],
and more diverse applications such as industrial assembly lines
and smart home, the sensing capacity of robotic systems needs to
and robotic manipulation [3], [4], since they require less hand- be enriched, not only in local regions, but also in global interac-
crafted feature engineering and can be extended to many com- tion. Especially when robots frequently interact with objects in
plex tasks. In recent years, as visual sensing is increasingly being the environment, the awareness of global attention is particularly
used in manufacturing, industry, and medical care, growing important with respect to safety and reliability. However, most
research is devoted to developing advanced robot’s perception vision transformers are designed for image classification on
abilities. A typical application of visual sensing is the robotic natural images processing tasks. Few of them are specifically
grasp detection, where the images of objects are used to infer built for robotic tasks.
the grasping pose. Considering a grasping task of manipulating a In this letter, we present a transformer-based visual grasp
wide diversity of objects, to find the graspable regions, the robots detection framework, namely TF-Grasp, which leverages the
fact that the attention can better aggregate information across
the entire input sequences to obtain an improved global repre-
Manuscript received 23 February 2022; accepted 20 June 2022. Date of
publication 29 June 2022; date of current version 7 July 2022. This letter was sentation. More specifically, the information within independent
recommended for publication by Associate Editor S. Jain and Editor M. Vincze image patches is bridged via self-attention and the encoder
upon evaluation of the reviewers’ comments. This work was supported by the in our framework captures these multi-scale low-level fea-
National Natural Science Foundation of China under Grants U2013601 and
62173314. (Corresponding author: Zhen Kan.) tures. The decoder incorporates the high-level features through
The authors are with the Department of Automation, University of Sci- long-range spatial dependencies to construct the final grasping
ence and Technology of China, Hefei 230026, China (e-mail: samwang@mail. pose. We provide detailed empirical evidence to show that our
ustc.edu.cn; [email protected]; [email protected]).
Digital Object Identifier 10.1109/LRA.2022.3187261 grasping transformer performs reasonably well on popular
2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on January 30,2024 at 04:18:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: WHEN TRANSFORMER MEETS ROBOTIC GRASPING: EXPLOITS CONTEXT FOR EFFICIENT GRASP DETECTION 8171
grasping testbeds, e.g., Cornell and Jacquard grasping datasets. the use of ResNet-50 as a backbone to incorporate multimodal
The experimental results demonstrate that the transformer archi- including depth and RGB information to further improve the
tecture plays an integral role in generating appropriate grasping grasp performance. Besides, CNN-based grasp quality net-
poses by learning local and global features from different parts works [15], [16] were proposed to evaluate and predict the
of each object. The vision transformer-based grasp detection robustness of grasp candidates. In the same line, GG-CNN [17]
works well on the real robotic system and shows promising developed a fully convolutional neural network to perform
generalization to unseen objects. In addition, our TF-Grasp can grasp detection, which provides a lightweight and real-time
generate the required grasping poses for parallel grippers in a solution for visual grasping. Currently, most of the existing
single forward pass of the network. grasp detection methods are still heavily inspired by computer
In a nutshell, the contributions of this letter can be summarised vision techniques such as object recognition, object detection,
in three folds: etc. In contrast to classical visual problems where the detected
r This work presents a novel and neat transformer archi- objects are usually well-defined instances in the scene, in grasp
tecture for visual robotic grasping tasks. To the best of our detection, the grasp configuration to be generated is continuous,
knowledge, it is one of the first attempts considering vision which implies an infinite number of possible grasp options. This
transformers in grasp detection tasks. places significant challenges in feature extraction to identify a
r We consider simultaneous fusion of local and global fea- valid grasp configuration from all possible candidates. We argue
tures and redesign the classical ViT framework for robotic that the loss of long-term dependencies in feature extraction
visual sensing tasks. is a major drawback of current CNNs based grasp detection
r Exhaustive experiments are conducted to show the advan- methods.
tages of the transformer-based robotic perception frame-
work. The experimental results demonstrate that our
model achieves improved performance on popular grasp- B. Transformer
ing datasets compared to the state-of-the-art methods. We Transformer [6] first emerged in machine translation and is
further show that our grasping transformer can generate rapidly establishing itself as a new paradigm in natural lan-
appropriate grasping poses for known or unknown objects guage processing due to its potential to model global infor-
in either single or cluttered environments. mation, which learns the high quality features by considering
the whole context. Thanks to its excellent global representation
II. RELATED WORK and friendly parallel computation, the transformer is competitive
in long sequences modeling and gradually replaces RNNs and
This section reviews recent advances in the field of robotic
CNNs.
grasping and briefly describes the progress of transformers in
Motivated by the remarkable success of transformers achieved
different areas.
in natural language processing, more and more researchers are
interested in the employment of attention mechanisms in visual
A. Grasp Detection
tasks. At present, the transformer has been successfully applied
The ability to locate the object position and determine the to image classification, object detection, and segmentation tasks.
appropriate grasping pose is crucial to stable and robust robotic However, there still exist many challenges. First, visual signals
grasping. Grasp detection, as the name implies, uses the image and word tokens are very different on many scales. Second,
captured from the camera to infer the grasping pose for the robot the high dimension of pixel-level information may introduce
manipulator. Using geometry-driven methods, earlier works [9], significant computational complexity.
[10] mainly focus on analyzing the contours of objects to identify More recently, ViT [7] was presented as a transformer model
grasping points. A common assumption in these methods is to tackle natural images recognition, which splits the image
that the geometric model of the object is always available. into non-overlapping patches. The authors in [8] proposed a
However, preparing the CAD models for graspable objects is hierarchical ViT called Swin-Transformer by calculating the
time-consuming and impractical for real-time implementation. local self-attention with shifted windows. In contrast to the
Recently, deep learning based methods have been successfully quadratic computation complexity of self-attention in ViT, Swin-
applied in visual grasping tasks [3], [5], [11]–[13]. The work Transformer achieves a linear complexity. Inspired by this fash-
of [14] is one of the earliest works that introduces deep neural ion, many researchers have tried to apply transformer to other
networks to grasp detection via a two-stage strategy where the fields. For example, TransUNet [18] combines transformer and
first stage finds exhaustive possible grasping candidates and the Unet [19] for medical image diagnosis. Nevertheless, how to
second stage evaluates the quality of these grasp candidates exploit the strengths of attention to aggregate information from
to identify the best one. However, due to numerous grasping entire inputs has not been investigated in the task of visual
proposals, the method in [14] suffers from relatively slow speed. grasp detection. Unlike prior works, we design a transformer
Many recent works utilize convolutional neural networks to based encoder-decoder architecture to predict the grasp pos-
generate bounding box proposals to estimate the grasp pose ture in an end-to-end manner. It is shown that our method
of objects. Redmon et al. [5] employed an Alexnet-like CNN achieves higher grasp success than the state-of-the-art CNNs
architecture to regress grasping poses. Kumra et al. [3] explored counterparts.
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on January 30,2024 at 04:18:02 UTC from IEEE Xplore. Restrictions apply.
8172 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 3, JULY 2022
III. METHOD
Grasp Representation: The autonomous visual grasping tasks
generally start from collecting visual images of the object by
sensory input, which will then be processed to generate an
effective grasp configuration to maximise the probability of
grasp success. Considering a parallel-plate gripper, the grasp
representation g [20] is formulated as a 5-dimensional tuple:
g = {x, y, θ, w, h} (1)
where (x, y) are the center coordinates of the grasp rectangle,
(w, h) denote the width and height of the grasp rectangle,
and θ is the orientation of the grasp rectangle with respect to
the horizontal axis. Given a gripper with known dimensions,
a simplified representation can be expressed as g = (p, φ, w)
where p = (x, y), φ indicates the orientation angle of gripper
and w denotes the opening distance of gripper, respectively.
To facilitate grasping, we follow the setting in [17] to represent
the grasp in 2-D image space as
G = {Q, W, Θ} ∈ R3×W ×H , (2)
where the grasp quality Q measures the grasp success of each
pixel, and W and Θ are the gripper width and orientation angle
maps. The value of each pixel in W and Θ represents the Fig. 1. Overview of the TF-grasp model. Our model takes as input the image
captured by the camera mounted on the end-effector of the manipulator and
corresponding width and angle of gripper at that position during generates a pixel-level grasp representation.
the grasping.
Consequently, in the developed TF-Grasp, the grasp detection
task boils down to three sub-tasks, namely the problems of
predicting grasping position, angle, and width. window interactions to globally focus on more diverse regions.
Grasp Transformer Overview: A deep motivation of this work (c) Feature Fusion. The feature representations at different stages
is that the treatment of robot perception in complex, dynamic are connected by skip-connections for a multi-scale feature
robotic tasks should be global and holistic with information mu- fusion, which acquire both rich semantic and detailed features.
tual fusion. Specifically, the grasping model can be formulated (d) Lightweight Design. It is essential for robots to account
into an encoder-decoder architecture with a U-shaped structure, for efficiency and the real-time performance. We utilize shifted
as detailed in Fig. 1. The encoder branch aggregates the entire attention blocks and a slimming design for our grasping trans-
visual input, mutually fuses features by using attention blocks, former to reach an ideal trade-off between the performance and
and then extracts the specific features that are useful for visual speed.
robotic grasping. During the decoder process, the model incor- Grasp Transformer Encoder: Before being fed into the en-
porates features delivered via skip-connections and performs coder, the image is first passed through patch partition layer and
a pixel-level grasp prediction by up-sampling. More concretely, is then cut into non-overlapping patches. Each patch is treated as
the attention modules in the decoder enable more comprehensive a word token in the text. For example, a 2D image I ∈ RW ×H×C
processing of local and long-range information, allowing for is split into fixed-size patches x ∈ RN ×(P ×P ×C) , where (H, W )
better multi-scale feature fusion. Each pixel in the prediction denote the height and width of the original image, C represents
heatmap is correlated with the final location and orientation of the channel of the image, P is the shape size of each image patch,
the end-effector. and N = H × W/P 2 refers to the number of image patches.
To bridge the domain gaps between the transformer and Then token-based representations can be obtained by passing
visual robotic grasping tasks, we have carefully designed our the images patches into a projection layer.
grasping transformer in the following aspects for improved grasp The encoder is composed by stacking identical transformer
detection. (a) Cascade Design. Different from the classic ViT blocks. Attentions in the transformer block build long-distance
architecture, we adapt a cascaded encoder-decoder structure. interactions across distant pixels and attend on these positions
The encoder utilizes self-attention to learn a contextual repre- in the embedding space. At the top of the encoder is a bottleneck
sentation that facilitates grasping and the decoder makes use of block attached to the decoder. The fundamental element in our
the extracted features to perform a pixel-level grasp prediction. grasping transformer framework is the multi-head self-attention.
(b) Local and Global balance. We utilize the swin attention layer The input feature X is linearly transformed to derive the query
to achieve a trade-off between global and local information Q, key K, and value V , which are defined as follows:
for better scene perception. Window attention performs local
feature extraction and the shifted-window attention allows cross Q = XWQ , K = XWK , V = XWV , (3)
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on January 30,2024 at 04:18:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: WHEN TRANSFORMER MEETS ROBOTIC GRASPING: EXPLOITS CONTEXT FOR EFFICIENT GRASP DETECTION 8173
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on January 30,2024 at 04:18:02 UTC from IEEE Xplore. Restrictions apply.
8174 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 3, JULY 2022
Fig. 3. The visualized attention heatmaps learned by our method, which show that our transformer model can learn the concepts beneficial for grasping.
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on January 30,2024 at 04:18:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: WHEN TRANSFORMER MEETS ROBOTIC GRASPING: EXPLOITS CONTEXT FOR EFFICIENT GRASP DETECTION 8175
TABLE II and local information. From Fig. 4, we can see that the grasp
THE ACCURACY ON JACQUARD GRASPING DATASET
rectangles predicted by CNN have the right grasp position in
most cases, but the predicted gripper angle and width are often
not appropriate. In some cases, CNN even generates grasping
rectangles in the background. With the attention mechanism, our
transformer-based model is able to clearly identify the objects
from the background. In the second row of Fig. 4, the grasping
quality images show that the CNN-based approach can not iden-
tify the graspable area and consider the entire region of objects
as a graspable zone with high success probabilities. Instead, as
shown in the fourth row of Fig. 4, the transformer-based model is
prone to capture the area that is easy to grasp due to its larger re-
ceptive field. For each attention block, the attention operation es-
tablishes the inter-element relationships through self-attention,
and the subsequent multi-layer-perceptron (MLP) module fur-
Table II, we use 90% data of the Jacquard dataset as the training ther models the inherent relation between each element. The
set and the remaining 10% as the validation set. In addition, layer normalization and residual connections that interleave
our model takes about 41 ms to process a single image using these two operations keep the training stable and efficient. In
the Intel Core i9-10900X CPU processor, which is competitive contrast, in CNN, the receptive field of each convolutional kernel
with the state-of-art approaches and basically meets the real- is limited. To build a larger receptive field, the model often
time requirements. The transformer grasping model exhibits a needs to repeatedly stack convolutional layers to gain global and
better accuracy on both datasets compared to conventional CNN semantically rich features. However, such a method in general
models. Our proposed approach achieves a higher accuracy of results in the loss of detailed feature information such as the
94.6% which is on-par or superior to previous methods. The position and shape information of objects that are essential
results on the Cornell and Jacquard datasets all indicate that the for grasping tasks. Therefore, we exploit a transformer-based
model with the attention mechanism is more suitable for visual model which can better capture not only the global informa-
grasping tasks. tion but also detailed features (e.g., the position and shape
Despite the fact that our model is trained on a single object information).
dataset, it can be well adapted to multi-object environments with
the help of attention mechanisms. In addition, to evaluate the C. Visualization Analysis
advantages of the transformer versus CNNs for visual grasping
To clarify why the transformer architecture is helpful for grasp
tasks, we use the original convolution layers, residual layers, and
detection tasks, we visualize the heatmaps of attention maps,
our transformer as feature extractors to test detection accuracy on
detailed in Fig. 3. From these heat maps, we can discover that the
different objects on the Cornell dataset. We apply an object-wise
self-attention modules can readily learn the area that is easy for
split to the Cornell dataset and Fig. 5 shows the detection accu-
grasping, such as the edges of objects, ignore irrelevant details,
racy of objects not seen during the training phase. All objects
and pay more attention on the contour and shape of the objects.
are subsets of the Cornell dataset and are evaluated 5 times. All
Meanwhile, the model focuses on more general characteristics
models shown in Fig. 5 employ an encoder-decoder architecture
rather than individual features. For example, for the chairs
with 4 stages in order to guarantee a fair comparison, where
shown in Fig. 3, our method evaluates the edge of the chairs
the original-conv is a fully convolutional neural network and
with a higher grasp quality. We further provide more concrete
resnet-conv is to replace the original convolution layer with the
examples of real-world grasping, and the experimental results
residual block. The result of different models is shown in Fig. 5.
show that the attention mechanism is more likely to achieve a
Note that the transformer outperforms original convolutions on
better understanding of the grasping scenario, generate more
all selected objects and is marginally better or on-par with the
accurate grasping rectangles, and work well on both household
residual network.
and novel objects. In Fig. 6, we illustrate a pick-and-place task
These results demonstrate that the transformer improves
based on our TF-Grasp on the Franka manipulator. Our grasp
robotic grasp detection. We conjecture that prior methods that
detection system works well for novel objects that have not been
rely on local operations of the convolution layers might ignore
seen during training procedure and also locates graspable objects
the dependencies between long-range pixels. Instead, our ap-
in cluttered environments.
proach leverages the attention mechanism to exploit both local
In conclusion, the visualization results indicate that our TF-
and global information and integrates features that are useful for
Grasp can produce a more general and robust prediction, which
grasping. To better demonstrate whether the transformer-based
contributes to improving the detection accuracy.
grasping model can model the relationships between objects and
across the scene, we present the multi-object grasping results
and grasping quality heatmaps of the transformer and CNN in D. Ablation Studies
Fig. 4. Our aim is to verify that the transformer is preferred over To understand the role of skip-connections in our transformer
CNN for visual grasping tasks and is better at capturing global model on the visual grasping problems, we conduct experiments
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on January 30,2024 at 04:18:02 UTC from IEEE Xplore. Restrictions apply.
8176 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 3, JULY 2022
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on January 30,2024 at 04:18:02 UTC from IEEE Xplore. Restrictions apply.
WANG et al.: WHEN TRANSFORMER MEETS ROBOTIC GRASPING: EXPLOITS CONTEXT FOR EFFICIENT GRASP DETECTION 8177
TABLE IV [11] H. Zhang, X. Lan, S. Bai, X. Zhou, Z. Tian, and N. Zheng, “ROI-based
THE RESULTS FOR PHYSICAL SETUP robotic grasp detection for object overlapping scenes,” in Proc. IEEE/RSJ
Int. Conf. Intell. Robots Syst. IROS, 2019, pp. 4768–4775.
[12] U. Asif, J. Tang, and S. Harrer, “GraspNet: An efficient convolutional
neural network for real-time grasp detection for low-powered devices,” in
Proc. 27th Int. Joint Conf. Artif. Intell., 2018, pp. 4875–4882.
[13] X. Zhu, L. Sun, Y. Fan, and M. Tomizuka, “6-DoF contrastive grasp
proposal network,” in Proc. IEEE Int. Conf. Robot. Automat. ICRA, 2021,
pp. 6371–6377.
[14] I. Lenz, H. Lee, and A. Saxena, “Deep learning for detecting robotic
grasps,” Int. J. Robot. Res., vol. 34, no. 4-5, pp. 705–724, 2015.
V. DISCUSSION AND CONCLUSION [15] J. Mahler et al., “Dex-net 2.0: Deep learning to plan robust grasps with
synthetic point clouds and analytic grasp metrics,” in Proc. Robot.: Sci.
In this work, we develop a novel architecture for visual grasp- Syst. XIII, Cambridge, MA, USA, 2017.
[16] A. Gariépy, J.-C. Ruel, B. Chaib-Draa, and P. Giguere, “GQ-STN: Opti-
ing. Although CNN and its variants are still the dominant models mizing one-shot grasp detection based on robustness classifier,” in Proc.
in visual robotic grasping, we show the powerful potential of IEEE/RSJ Int. Conf. Intell. Robots Syst. IROS, 2019, pp. 3996–4003.
transformers in grasp detection. Compared with CNN-based [17] D. Morrison, P. Corke, and J. Leitner, “Learning robust, real-time, reactive
robotic grasping,” Int. J. Robot. Res., vol. 39, no. 2-3, pp. 183–201,
counterparts, the transformer-based grasp detection models are 2020.
better at capturing global dependencies and learning powerful [18] J. Chen et al., “TransUNet: Transformers make strong encoders for medical
feature representation. The results show that our proposed ap- image segmentation,” 2021, arXiv:2102.04306.
[19] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks
proach outperforms original CNN-based models. The contexts for biomedical image segmentation,” in Proc. Int. Conf. Med. Image
can be better represented by attention propagation. Nevertheless, Comput. Comput.-Assisted Intervention, 2015, pp. 234–241.
the current approach is limited to the parallel gripper. Future [20] Y. Jiang, S. Moseson, and A. Saxena, “Efficient grasping from RGBD
images: Learning using a new rectangle representation,” in Proc. IEEE
research will focus on developing a universal transformer-based Int. Conf. Robot. Automat., 2011, pp. 3304–3311.
grasp detection method for other types of grippers, such as the [21] A. Depierre, E. Dellandréa, and L. Chen, “Jacquard: A large scale dataset
five finger dexterous hand. for robotic grasp detection,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots
Syst. IROS, 2018, pp. 3511–3516.
[22] Z. Wang, Z. Li, B. Wang, and H. Liu, “Robot grasp detection using
REFERENCES multimodal deep convolutional neural networks,” Adv. Mech. Eng., vol. 8,
no. 9, 2016, Art. no. 1687814016668077.
[1] J. Song, M. Patel, and M. Ghaffari, “Fusing convolutional neural net- [23] U. Asif, M. Bennamoun, and F. A. Sohel, “RGB-D object recognition and
work and geometric constraint for image-based indoor localization,” IEEE grasp detection using hierarchical cascaded forests,” IEEE Trans. Robot.,
Robot. Automat. Lett., vol. 7, no. 2, pp. 1674–1681, Apr. 2022. vol. 33, no. 3, pp. 547–564, Jun. 2017.
[2] D. Zhao and J. Oh, “Noticing motion patterns: A temporal CNN with a [24] H. Karaoguz and P. Jensfelt, “Object detection approach for robot grasp
novel convolution operator for human trajectory prediction,” IEEE Robot. detection,” in Proc. IEEE Int. Conf. Robot. Automat. ICRA, 2019, pp. 4953–
Automat. Lett., vol. 6, no. 2, pp. 628–634, Apr. 2021. 4959.
[3] S. Kumra and C. Kanan, “Robotic grasp detection using deep convolutional [25] D. Guo, F. Sun, H. Liu, T. Kong, B. Fang, and N. Xi, “A hybrid deep
neural networks,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. IROS, architecture for robotic grasp detection,” in Proc. IEEE Int. Conf. Robot.
2017, pp. 769–776. Automat. ICRA, 2017, pp. 1609–1614.
[4] X. Zhu, Y. Zhou, Y. Fan, and M. Tomizuka, “Learn to grasp with less [26] S. Ainetter and F. Fraundorfer, “End-to-end trainable deep neural network
supervision: A data-efficient maximum likelihood grasp sampling loss,” for robotic grasp detection and semantic segmentation from RGB,” in Proc.
2021, arXiv:2110.01379. IEEE Int. Conf. Robot. Automat. ICRA, 2021, pp. 13452–13458.
[5] J. Redmon and A. Angelova, “Real-time grasp detection using convolu- [27] S. Kumra, S. Joshi, and F. Sahin, “Antipodal robotic grasping using
tional neural networks,” in Proc. IEEE Int. Conf. Robot. Automat. ICRA, generative residual convolutional neural network,” in Proc. IEEE/RSJ Int.
2015, pp. 1316–1322. Conf. Intell. Robots Syst. IROS, 2020, pp. 9626–9633.
[6] A. Vaswani et al., “Attention is all you need,” in Proc. Annu. Conf. Neural [28] I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in
Inf. Process. Syst., Long Beach, CA, USA, 2017, pp. 5998–6008. Proc. Int. Conf. Learn. Representations, 2018.
[7] A. Dosovitskiy et al., “An image is worth 16x16 words: Transformers for [29] X. Zhou, X. Lan, H. Zhang, Z. Tian, Y. Zhang, and N. Zheng, “Fully
image recognition at scale,” in Proc. Int. Conf. Learn. Representations, convolutional grasp detection network with oriented anchor box,” in Proc.
2021. IEEE/RSJ Int. Conf. Intell. Robots Syst. IROS, 2018, pp. 7223–7230.
[8] Z. Liu et al., “Swin transformer: Hierarchical vision transformer using [30] G. Bradski, “The OpenCV library,” Dr Dobb’s J.: Softw. Tools Professional
shifted windows,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2021, Programmer, vol. 25, no. 11, pp. 120–123, 2000.
pp. 10012–10022. [31] L. Pinto and A. Gupta, “Supersizing self-supervision: Learning to grasp
[9] R. M. Murray, Z. Li, and S. S. Sastry, A Mathematical Introduction to from 50k tries and 700 robot hours,” in Proc. IEEE Int. Conf. Robot.
Robotic Manipulation. Boca Raton, FL, USA: CRC Press, 2017. Automat. ICRA, 2016, pp. 3406–3413.
[10] A. Bicchi and V. Kumar, “Robotic grasping and contact: A review,” in [32] F.-J. Chu, R. Xu, and P. A. Vela, “Real-world multiobject, multigrasp
Proc. IEEE Int. Conf. Robot. Automat., San Francisco, CA, USA, 2000, detection,” IEEE Robot. Automat. Lett., vol. 3, no. 4, pp. 3355–3362,
pp. 348–353. Oct. 2018.
Authorized licensed use limited to: Indian Institute of Technology Hyderabad. Downloaded on January 30,2024 at 04:18:02 UTC from IEEE Xplore. Restrictions apply.