0% found this document useful (0 votes)
9 views

Cognolato_2022

This research investigates a multimodal approach to improve the control of myoelectric hand prostheses by integrating eye tracking and computer vision, leveraging eye-hand coordination behavior. The study utilizes the MeganePro Dataset, which includes data from transradial amputees and able-bodied subjects, to enhance grasp-type classification accuracy significantly. Results indicate that incorporating visual information can increase classification accuracy for amputees to levels comparable to intact subjects, suggesting a more robust control system for prosthetic hands.

Uploaded by

aditi21122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Cognolato_2022

This research investigates a multimodal approach to improve the control of myoelectric hand prostheses by integrating eye tracking and computer vision, leveraging eye-hand coordination behavior. The study utilizes the MeganePro Dataset, which includes data from transradial amputees and able-bodied subjects, to enhance grasp-type classification accuracy significantly. Results indicate that incorporating visual information can increase classification accuracy for amputees to levels comparable to intact subjects, suggesting a more robust control system for prosthetic hands.

Uploaded by

aditi21122
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

ORIGINAL RESEARCH

published: 25 January 2022


doi: 10.3389/frai.2021.744476

Improving Robotic Hand Prosthesis


Control With Eye Tracking and
Computer Vision: A Multimodal
Approach Based on the Visuomotor
Behavior of Grasping
Matteo Cognolato 1,2 , Manfredo Atzori 1,3*, Roger Gassert 2 and Henning Müller 1,4*
1
Institute of Information Systems, University of Applied Sciences and Arts of Western Switzerland (HES-SO Valais-Wallis),
Sierre, Switzerland, 2 Rehabilitation Engineering Laboratory, Department of Health Sciences and Technology, ETH Zurich,
Zurich, Switzerland, 3 Department of Neuroscience, University of Padua, Padua, Italy, 4 Faculty of Medicine, University of
Geneva, Geneva, Switzerland

The complexity and dexterity of the human hand make the development of natural
and robust control of hand prostheses challenging. Although a large number of control
approaches were developed and investigated in the last decades, limited robustness in
Edited by: real-life conditions often prevented their application in clinical settings and in commercial
Ragnhild Eg, products. In this paper, we investigate a multimodal approach that exploits the use
Kristiania University College, Norway
of eye-hand coordination to improve the control of myoelectric hand prostheses. The
Reviewed by:
Strahinja Dosen, analyzed data are from the publicly available MeganePro Dataset 1, that includes
Aalborg University, Denmark multimodal data from transradial amputees and able-bodied subjects while grasping
Christian Nissler,
numerous household objects with ten grasp types. A continuous grasp-type classification
German Aerospace Center (DLR),
Germany based on surface electromyography served as both intent detector and classifier. At the
*Correspondence: same time, the information provided by eye-hand coordination parameters, gaze data
Manfredo Atzori and object recognition in first-person videos allowed to identify the object a person
[email protected]
Henning Müller aims to grasp. The results show that the inclusion of visual information significantly
[email protected] increases the average offline classification accuracy by up to 15.61 ± 4.22% for the
transradial amputees and of up to 7.37 ± 3.52% for the able-bodied subjects, allowing
Specialty section:
This article was submitted to
trans-radial amputees to reach average classification accuracy comparable to intact
Machine Learning and Artificial subjects and suggesting that the robustness of hand prosthesis control based on grasp-
Intelligence,
type recognition can be significantly improved with the inclusion of visual information
a section of the journal
Frontiers in Artificial Intelligence extracted by leveraging natural eye-hand coordination behavior and without placing
Received: 20 July 2021 additional cognitive burden on the user.
Accepted: 06 December 2021
Keywords: hand prosthetics, electromyography, deep learning, multi-modal machine learning, eye-tracking,
Published: 25 January 2022
eye-hand coordination, assistive robotics, manipulators
Citation:
Cognolato M, Atzori M, Gassert R and
Müller H (2022) Improving Robotic
Hand Prosthesis Control With Eye
1. INTRODUCTION
Tracking and Computer Vision: A
Multimodal Approach Based on the
The loss of a hand deprives an individual of an essential part of the body, and a
Visuomotor Behavior of Grasping. prosthesis that can be controlled intuitively and reliably is therefore essential to effectively
Front. Artif. Intell. 4:744476. restore the missing functionality. Dexterous hand prostheses with notable mechanical
doi: 10.3389/frai.2021.744476 capabilities are now commercially available. They commonly have independent digit actuation,

Frontiers in Artificial Intelligence | www.frontiersin.org 1 January 2022 | Volume 4 | Article 744476


Cognolato et al. Visuomotor Multimodal Robotic Hand Control

active thumb opposition, sufficient grip force and sometimes a are retrieved by briefly fixating the object to be grasped before
motorized wrist. These characteristics make such devices capable the hand movement (Land et al., 1999; Johansson et al., 2001;
of performing a large variety of grasps that can substantially Land, 2006). Several studies have investigated the use of eye
simplify the execution of activities of daily living (ADL) for hand tracking techniques to improve the human-robot interaction
amputees. On the other hand, to fully exploit these capabilities, during grasping. Castellini and colleagues have investigated the
the control system must be able to precisely and reliably decode use of gaze to increase the level of autonomy of a robotic
the grasp the user intends to perform. Although numerous device in the context of teleoperation, imagining the possible
non-invasive strategies have been developed to achieve a robust benefit this approach could have in the control of prosthetic
and natural control for multifunction hand prostheses, their hands (Castellini and Sandini, 2007). In Corbett et al. (2012), an
employment in commercial products and clinical practice is eye tracker was employed to improve the trajectory estimation
still limited (Castellini et al., 2014; Farina et al., 2014; Vujaklija of an assistive robotic hand for spinal cord injury patients.
et al., 2016). Pattern recognition-based approaches are arguably The authors concluded that the inclusion of gaze data not
the most investigated ones in scientific research. They identify only improved the trajectory estimation but also reduced the
the grasp type by applying pattern recognition methods to burden placed on the user, facilitating the control. The electro-
the electrical activity of the remnant musculature recorded oculography technique was used in Hao et al. (2013) to extract
via surface electromyography (sEMG) (Hudgins et al., 1993; object characteristics to pre-shape a hand prosthesis. In this work
Scheme and Englehart, 2011; Jiang et al., 2012). Despite the the participants were asked to scan the object’s contour with their
remarkable performance obtained by these methods in controlled eyes, allowing the system to select the most suitable grasp type
environments, they commonly have difficulty providing the by predicting its affordances. Eye-hand coordination parameters
level of robustness required for daily life activities in real-life were used in Cognolato et al. (2017) to semi-automatically extract
conditions (Jiang et al., 2012; Castellini et al., 2014; Farina and patches of the object a person aims to grasp for the training
Amsüss, 2016; Campbell et al., 2020). The intrinsic variability of an object recognition system. Gigli et al. (2018) investigated
of the electromyographic signals and the presence of factors the inclusion of information of the target object for grasp-
affecting them are arguably the main causes for the lack of type classification tasks. This evaluation exploited the use of
robustness of pattern recognition-based myoelectric control sEMG as reaching phase detector, which triggers a fixation search
methods (Farina et al., 2014; Campbell et al., 2020). Several to identify and segment the aimed object. Once the object is
strategies have been proposed to overcome these limitations, identified, a convolutional neural network (CNN) extracts visual
such as the development or selection of more robust signal features that are fused at kernel level with the sEMG modality.
features (e.g., Phinyomark et al., 2012; Khushaba et al., 2014, The results show a consistent improvement in classification
2017; Al-Timemy et al., 2016), the application of more advanced accuracy for 5 able-bodied subjects with respect to the unimodal
methods of analysis, such as deep learning (e.g., Atzori et al., sEMG-based classification, showing that the inclusion of visual
2016; Geng et al., 2016; Faust et al., 2018), the use or addition information produces an increment in grasp-type recognition
of different modalities (e.g., Castellini et al., 2012; Gijsberts robustness. The work by Gigli et al. (2018) sets a fundamental
and Caputo, 2013; Jaquier et al., 2017), and the inclusion of baseline in the domain of prosthetics. However, despite being
complementary sources of information to increase the autonomy on a similar topic and dataset, the previous work is strongly
of the control system (e.g., Došen et al., 2010; Markovic different from this paper. First, the authors did not include hand
et al., 2015; Amsuess et al., 2016). Despite all the mentioned amputees in the dataset and they included a small number of
difficulties, in the last years research achievements in pattern intact subjects. Second, the approach was not fully based on
recognition helped to develop commercial pattern classification deep neural networks. Third, it included the identification of
approaches, such as COAPT engineering 1 and Myo Plus from fixations, which brings the disadvantage of shortening the time
Ottobock 2 . The idea of providing a prosthesis with decision- frame to identify the target object, that could prevent a correct
making capabilities is not novel (Tomovic and Boni, 1962), and identification (Gregori et al., 2019). Evaluations on transradial
several approaches have been proposed and investigated over amputees are particularly desirable to better investigate the
the years. A promising method relies on identifying the most performance of multimodal approaches based on gaze and sEMG
appropriate grasp type by obtaining information of the object for prosthetic applications. This work aims at investigating the
a person aims to grasp (Došen and Popović, 2010; Markovic benefit of including visual information obtained by unobtrusively
et al., 2014; Ghazaei et al., 2017; Taverne et al., 2019). In order exploiting eye-hand coordination parameters in transradial
to do so, the control system must have the ability to identify amputees and able-bodied subjects to achieve an improved
and extract information of the target object reliably, and the and more robust grasp-type classification for hand prosthesis
reliance on visual information is a common and natural strategy control. To do so, we use the recently released MeganePro
for this purpose. The use of visual modalities is motivated Dataset 1 (Cognolato et al., 2020), which includes sEMG,
by the natural eye-hand coordination behavior humans use accelerometry, gaze, and first-person videos recorded from 15
during grasping, where information to plan the motor action transradial amputees and 30 able-bodied subjects while grasping
several objects with ten grasp types. We used a Convolutional
1 http://www.coaptengineering.com/ Long Short-Term Memory (ConvLSTM) network to perform
2 https://www.ottobockus.com/prosthetics/upper-limb-prosthetics/solution- sEMG-based grasp-type classification and a Mask Region-based
overview/myo-plus/myo-plus.html Convolutional Neural Network (Mask R-CNN) (He et al., 2017)

Frontiers in Artificial Intelligence | www.frontiersin.org 2 January 2022 | Volume 4 | Article 744476


Cognolato et al. Visuomotor Multimodal Robotic Hand Control

to identify and segment the objects in front of the subject. TABLE 1 | Overview of the grasp types and objects for the condition of the
Eye-hand coordination parameters were used to identify the exercise.

object the subject aims to grasp and the information from both Grasp Object
modalities is combined for a final prediction.
Bottle
1 Medium wrap
2. MATERIALS AND METHODS Can
Door handle
2.1. Data Mug
The data used in this work are publicly available in the 2 Lateral Key
MeganePro Dataset 1 (MDS1) (Cognolato et al., 2020). The Pencil case
sEMG data were collected at 1926 Hz using a Delsys Trigno Plate
Wireless EMG System (Delsys Inc., US). Gaze and first-person 3 Parallel extension Book
video were recorded at 100 Hz and 25 frames per second (FPS), Drawer
respectively, with a Tobii Pro Glasses 2 eye tracker (Tobii Bottle
AB, SE). The dataset contains recordings from 15 transradial 4 Tripod grasp Mug
amputees and 30 able-bodied subjects while grasping numerous Drawer
household objects with ten grasp types. Twelve sEMG electrodes Ball
were placed around the forearm or residual limb in a two-array 5 Power sphere Bulb
configuration. The acquisition protocol consisted of two parts. In Key
the static condition, subjects were asked to statically perform 10 Jar
grasps, from a seated and a standing position. Each grasp was 6 Precision disk Bulb
matched with three household objects chosen among a total of
Ball
18 objects as shown in Table 1) (Cognolato et al., 2020). Grasp-
clothespin
object pairings were chosen for having each grasp used with
7 prismatic pinch key
multiple objects and vice versa. At least five objects were placed
can
in front of the subject, simulating a real environment. In the
Remote
dynamic condition, the participants were asked to perform an
8 Index finger extension Knife
action with the object after having grabbed it (i.e., opening a
Fork
door handle or drinking from a can). The dynamic condition
Screwdriver
was repeated eight times on a set of two objects, either standing
9 Adducted thumb Remote
or seated. Explanatory videos, showed at the beginning of each
Wrench
grasp-block, instructed the participants on how to perform the
Knife
grasps, and vocal instructions guided them through the exercises.
10 Prismatic four finger Fork
The multiple relationships between objects and grasp types, as
well as the simultaneous presence of several objects in the scene Wrench

in front of the subjects, made the acquisition protocol similar


to an everyday life scenario, providing a set of data suitable to
investigate a multimodal control strategies based on sEMG, gaze
and visual information. capability of handling spatiotemporal information (Shi et al.,
2015). The network used in this work consists of a Convolutional
2.2. Unimodal sEMG-Based Grasp-Type Long Short-Term Memory layer with 128 filters and a kernel size
Classification of 1 by 3, followed by a dropout layer with a rate of 0.5, a flatten
We employed a ConvLSTM to perform the grasp-type layer, two fully connected layers of 200 and 50 units, respectively,
classification based on solely electromyographic signals with Rectified Linear Unit (ReLU) as activation function with
(Figure 2). One of the advantages of this network architecture a dropout having 0.2 as dropout rate. A fully connected layer
is the capability of exploiting both spatial and temporal with 11 units (the number of classes) and softmax as activation
relationships of the data (Shi et al., 2015). This characteristic can function provides the output of the network. We used the
positively impact the performance in this type of applications, categorical cross entropy loss function and Adam (Kingma and
where the recognition of a grasp type from the muscular activity Ba, 2014) as optimizer with the default parameters. The network
of the extrinsic muscles of the hand can be discriminated by was implemented using the Keras functional API.
taking into account which muscle activates (i.e., via its location) The sEMG data were shaped following a rest-grasp-rest
and the temporal pattern of such contractions. The ConvLSTM pattern, with a random amount of rest of a minimum of 1 s, and
network (Shi et al., 2015) is based on the widely used Long including part of the rest after the previous explanatory video for
Short-Term Memory (LSTM) (Hochreiter and Schmidhuber, the first repetition. Therefore, a total of 320 sEMG segments were
1997) model. However, it differs from the LSTM conventional extracted per subject. We subsequently divided these segments
structure for the convolutional operations performed in both into 4 folds, following the scheme employed in Cognolato et al.
input and internal state transformations, which provide the (2020) and graphically described in Figure 1. Each fold contains

Frontiers in Artificial Intelligence | www.frontiersin.org 3 January 2022 | Volume 4 | Article 744476


Cognolato et al. Visuomotor Multimodal Robotic Hand Control

FIGURE 1 | Four-fold data structure segmentation schema. As described in section 2.1, each grasp-type was performed on 2 conditions (static and dynamic) and
repeated 32 times on different objects. Three repetitions from the static seated condition and from the static standing, and 2 from the dynamic condition formed a
fold, which contains 8 repetitions per grasp-type.

8 repetitions per grasp-type, namely 3 repetitions from the fine-tuned on the MeganePro objects. Gregori et al. (2019)
static seated condition, 3 from the static standing, and 2 from the demonstrated the substantial increase in average precision of this
dynamic, performed on different objects. This allowed us to test model with respect to the non-fine-tuned model when tested on
the performance with a 4-fold cross-validation procedure, where the on MeganePro objects, thanks also to the limited variability
3 folds were used to train the network and the held-out folds and number of objects employed in the MeganePro acquisitions.
for testing. For each training, a validation set of 30 repetitions To reduce the computation time, we extracted and stored the
was randomly drawn for each condition from the training set to contour of the objects identified by this network only from 2 s
evaluate the best model based on the validation accuracy. The before to 3.5 s after the beginning of the grasp identified from
validation set was obtained by randomly extracting a repetition the relabeled data. Furthermore, as done in Gregori et al. (2019),
from each condition per grasp type, resulting in a training set instances having a score level lower than 0.8 were discarded.
of 210 repetitions and a validation set of 30. We pre-processed At this stage, the gaze points within each video frame (at their
all the data by removing the mean, making the variance unitary original sampling rate) were used to identify the object being
with a scaler that was fit on the training set and performing looked at by the subject. Similarly to what was done in Gregori
data rectification. After this, the data were windowed with a et al. (2019), objects were considered looked at for grasping
window size of 200 samples (slightly more than 100 ms) with no purposes when the gaze point was closer 20 px to the object
overlap, obtaining a N × 200 × 12 tensor, with N the number of contour (evaluated with the Euclidian distance). This is done only
windows and 12 the number of electrodes. Finally, each window for valid gaze-frame instances that occur when the conditions
was divided into 10 subsequences, each structured as a single row of having a valid gaze-point estimation and at least one object
by 20 columns and 12 channels (one per electrode), obtaining recognized by the Mask R-CNN are both met (excluding the
an input tensor of shape N × 10 × 1 × 20 × 12 with which the person and background “object” classes). The main advantages
ConvLSTM was fed. For each 4-fold cross-validation procedure, of this approach is the decoupling of object and grasp intention
the network was trained for 150 epochs with batches of size 32, identifications, where the information about the object fixated
and the best models were saved based on the validation accuracy. by the user is instantly available once the grasp intention is
The predominancy of the rest class was taken into account by detected (Gregori et al., 2019).
providing the network with class weights.

2.4. Multimodal Analysis


2.3. Object Recognition and Segmentation The multimodal analysis consists of fusing gaze, visual data and
Object recognition and segmentation were performed with a sEMG with the aim of increasing the robustness of grasp-type
Mask R-CNN network (He et al., 2017), utilizing the model recognition (Figure 2). The multimodal analysis was performed
released by Gregori et al. (2019). The model uses a ResNet-50- on the data extracted with the Mask R-CNN, namely in the
Feature Pyramid Network (He et al., 2016; Lin et al., 2017) as time frame of 2 s before and 3.5 s after the beginning of the
backbone. It is based on the implementation provided by Massa grasp, maintaining the same 4-fold structure used for training the
and Girshick (2018) that was originally trained on the Common ConvLSTM models. The sEMG-based grasp-type classification
Objects in Context (COCO) dataset (Lin et al., 2014) and was performed every 20 samples (approximately 10 ms) with the

Frontiers in Artificial Intelligence | www.frontiersin.org 4 January 2022 | Volume 4 | Article 744476


Cognolato et al. Visuomotor Multimodal Robotic Hand Control

FIGURE 2 | Example of a typical unimodal and multimodal analysis process flow. The sEMG-based grasp-type classification (EMG in the figure) continuously identifies
the grasp type, and its output is taken as is when a rest condition or no object are detected (marked in blue). Once a grasp intent is identified, the offline-computed
visual information are loaded and, if a target object is successfully identified, the final grasp is selected only among the grasp-types paired with the identified object
(marked in green). The approach restarts as soon as a sample is classified as rest.

best models obtained in the unimodal sEMG-based grasp-type samples. We applied the Wilcoxon signed-rank test to results
classification step (section 2.2). This served to both identify the obtained from the same population group (i.e., amputees or
beginning of a grasp and classify the grasp type based only on non-disabled subjects). This test validates the variations of
the sEMG. The identification of the grasp intent is obtained by using different approaches (e.g., sEMG-based vs. multimodal) or
leveraging the ability of the ConvLSTM to differentiate between conditions (e.g., static vs. dynamic) on the same population data.
rest and grasp, and the multimodal data fusion is triggered The Mann-Whitney test was used to evaluate if the discrepancy
only after the recognition of a non-rest condition (i.e., when in correct object identification between the two population
a grasp type is detected, regardless of its type). To perform groups (e.g., amputees vs. able-bodied subjects) was statistically
the multimodal data fusion, the information stored during the significant. Non-parametric tests were chosen to cope with the
object recognition and segmentation step (section 2.3) is loaded non-normal data distribution. When comparing the sEMG-
each time a new valid gaze-frame instance is available. Once a based and multimodal approaches with the Wilcoxon signed-
grasp intention is detected, the target object is then identified rank test, we used the alternative hypothesis that the multimodal
as the last object being looked at in the previous 480 samples approach performs better than the unimodal. A null hypothesis
(approximately 250 ms). If no object is identified, the search of a difference between the conditions was used for the correct
continues until 500 ms after the grasp intention identification. In object identification comparison. The matched rank biserial
the case that an object is successfully identified, the grasp types correlation and the rank biserial correlation provided the effect
paired with the recognized object are fused with the information size for the Wilcoxon signed-rank test and the Mann-Whitney
provided by the sEMG-based classifier. In particular, the final test, respectively, and the values reported hereafter represent the
grasp type is chosen as the one having the highest rate in the magnitude of the effect size. The statistical analysis was applied
output vector of the ConvLSTM restricted to the grasp types to the results averaged per subject and was performed with
paired with the recognized object. If no object is identified within JASP (JASP Team, 2020).
this time-frame, the approach continues with only the sEMG-
based grasp-type classification. In both cases (i.e., whether an 3. RESULTS
object is identified or not), the approach restarts as soon as a
sample is classified as rest. In the subsequent analysis, the subject The results show that exploiting eye-hand coordination (via the
identified as S114 was excluded from the multimodal analysis due fusion of electromyography, gaze, and first-person video data)
to the strabism condition that negatively influenced the quality of significantly increases the average classification accuracy for both
the eye tracking data (Cognolato et al., 2020). intact subjects and amputees, suggesting that the robustness of
hand prosthesis control based on grasp-type recognition can be
2.5. Statistical Analysis improved significantly with the inclusion of visual information.
Two non-parametric statistical tests were applied to evaluate The next sections present the results obtained with the
the significance of the results: the Wilcoxon signed-rank test unimodal sEMG-based grasp-type classification, the rate of
for paired samples and the Mann-Whitney test for independent correct target object identification and the performance achieved

Frontiers in Artificial Intelligence | www.frontiersin.org 5 January 2022 | Volume 4 | Article 744476


Cognolato et al. Visuomotor Multimodal Robotic Hand Control

with the multimodal approach. The accuracy was evaluated per object recognized by the multimodal approach at the beginning
fold, while the rate of correct object identification is evaluated of the grasp corresponds to the target one. The average rate
per subject. of correct object identification is higher than 85 % for both
transradial amputees (91.88 ± 6.80%) and able-bodied subjects
3.1. Unimodal sEMG-Based Grasp-Type (86.08 ± 11.00%) for the static condition, indicating that the
Classification correct object was recognized for the vast majority of the trials.
The first step consisted in evaluating the performance of These values slightly decrease for the dynamic condition, where
the ConvLSTM on sEMG-based grasp-type classification. The the correct object was identified for the 79.73 ± 15.54% of the
average classification accuracies obtained with the 4-fold cross times for transradial amputees, and for the 79.50 ± 12.14% for
validation approach presented in section 2.2 on the 11 classes (ten able-bodied subjects. The statistical analysis revealed that the
grasp types and rest) are of 72.51 ± 5.31% and 74.54 ± 4.96% for condition (static or dynamic) has a statistically significant effect
transradial amputees and able-bodied subjects, respectively. on the correct object recognition rate (p <0.001 and p = 0.001
for amputees and able-bodied subjects, respectively), while no
3.2. Object Recognition via Eye-Hand significant difference is found between the groups (p = 0.051
and 0.441 for the static and dynamic conditions, respectively).
Coordination Parameters Figure 3 shows the results for both groups and conditions while
This section aims at evaluating whether the target object was the outcomes of the statistical analysis are summarized in Table 2.
correctly identified by using eye-hand coordination parameters.
This is crucial to assess the actual performance of the approach, 3.3. Multimodal Analysis
since the correct grasp type may still be retrieved even with an This section reports the results obtained by applying the
incorrect identification of the target object, given that multiple multimodal approach presented in section 2.4. To better
objects graspable with the same grasp type are always present investigate the possible condition-related differences, the results
in the scene. Therefore, this protocol-related aspect might are reported for the static and dynamic condition separately.
introduce a bias that can boost the results. To evaluate this It is worth noticing that, in this section, the unimodal
aspect, we considered an identification as correct when the and multimodal approaches are evaluated on the same data
(segmented as described in section 2.4), as, due to differences
in data segmentation, the results of the multimodal analysis are
not directly comparable to the ones described in the unimodal
sEMG-based grasp-type classification section (section 3.1). In
fact, while the unimodal approach previously presented was
tested on entire repetitions, in this section the testing was
performed on the data extracted in the object recognition and
segmentation phase, namely from 2 s before to 3.5 s after the
beginning of the grasp identified from the relabeled data.

3.3.1. Static Condition


The inclusion of gaze and visual information has led to a
substantial increase in grasp-type classification accuracy for both
the amputee and able-bodied groups for the static condition.
For amputees in particular, the average classification accuracy
obtained using only the sEMG modality was of 63.03 ± 5.36%
while the multimodal approach reaches on average 78.64 ±
6.13%, with an average increment of 15.61 ± 4.22% (Figure 4A).
FIGURE 3 | Object recognition rate via eye-hand coordination parameters for This difference is found to be statistically significant (p <0.001,
transradial amputees and able-bodied subjects in both static and effect size = 1, Figure 4B), indicating that the multimodal
dynamic conditions. The bar illustrates the mean value and the error bars the
standard deviation. **p <0.01, and ***p <0.001.
approach significantly increases the grasp-type classification
accuracy. The same trend, even though to a reduced extent,

TABLE 2 | Within-subject (left) and between-subjects (right) statistical description of the correct object recognition rate.

Amputees Able-bodied Static Functional

p-value Effect size p-value Effect size p-value Effect size P-value Effect size

Static - Dynamic <0.001 0.943 0.001 0.699 Amputees - Able-bodied 0.051 0.371 0.441 0.148

The table on the left reports the statistical analysis performed with the Wilcoxon test to evaluate the effect of the static and dynamic conditions within the same population group. The
evaluation of differences among amputees and able-bodied subjects for the two conditions obtained with the Mann-Whitney test is reported on the right.

Frontiers in Artificial Intelligence | www.frontiersin.org 6 January 2022 | Volume 4 | Article 744476


Cognolato et al. Visuomotor Multimodal Robotic Hand Control

FIGURE 4 | Comparison of grasp-type classification accuracy obtained with sEMG-based and multimodal approaches in transradial amputees and able-bodied
subjects for the static condition. The top and bottom of the box indicate the third and first quartiles, respectively. The central line reports the median, the whisker
extension follows the original definition. Data falling outside 1.5 times the interquartile range are considered outliers and indicated with circles. *p <0.05, **p <0.01, and
***p <0.001. (A) Grasp-type classification in transradial amputees for the static condition. (B) Grasp-type classification in able-bodied subjects for the static condition.

TABLE 3 | Statistical analysis of the accuracy achieved with the unimodal and 63.55 ± 5.23% and 70.92 ± 5.40%, respectively (Figure 5B).
multimodal approaches for the static and dynamic conditions. Also in this case, the increase in performance obtained with the
Amputees Able-bodied
multimodal approach is statistically significant (p <0.001, effect
size = 0.991, Table 3). The confusion matrices for the grasp-
p-value Effect size p-value Effect size type classification of the dynamic condition are reported in the
Supplementary Material section).
Static <0.001 1.000 <0.001 0.996
Dynamic <0.001 1.000 <0.001 0.991

The analysis was performed with the Wilcoxon test with the alternative assumption that 4. DISCUSSION
the results obtained with the unimodal approach were lower than the multimodal ones.

This work shows that a multimodal approach based on eye-


was obtained for able-bodied subjects that achieve an average hand coordination (thus fusing sEMG, gaze, and visual data)
increment of 6.56 ± 2.89%, where the approach based only on can significantly improve the performance of a grasp-type
sEMG data reached an average accuracy of 73.35 ± 6.14% while classifier, particularly for trans-radial amputees, suggesting that
the multimodal analysis 79.92 ± 5.85% (Table 3). Also in this the robustness of hand prosthesis control can be improved with
case, the increase in classification accuracy with the multimodal the inclusion of visual information. Furthermore, the results
approach was found to be statistically significant (p <0.001, effect indicate that the target object can be recognized with a high
size = 0.996, Table 3). The confusion matrices for the grasp- probability by relying on the eye-hand coordination parameters
type classification of the static condition are reported in the without placing any additional burden on the user.
Supplementary Material section. Target object recognition is overall slightly higher in
amputees, without significant difference between the two
3.3.2. Dynamic Condition populations (Table 2, p = 0.051 and 0.441 for the static and
The trend obtained for the static condition is also maintained for dynamic conditions, respectively). This result suggests that an
the dynamic one, where there is however an overall decrease in accurate identification of the target object can be accomplished:
classification accuracy. Hand amputees reached 58.99 ± 6.26% (1) by leveraging the ability of the classifier to discriminate
and 74.12 ± 8.87% as average grasp-type classification accuracy between rest and grasp, serving as an intention detector, (2) by
for the sEMG-based and multimodal methods, respectively, exploiting the timing of visuomotor coordination, and (3) by
with an average increment of 15.13 ± 6.32% (Figure 5A). continuously tracking the objects that the user is looking at (as
The Wilcoxon test revealed the statistical significance of these proposed in Gregori et al., 2019). The use of the classifier as
results (p <0.001, effect size = 1, Table 3), which indicates both intent detector and grasp-type classifier eliminates the need
that the multimodal approach leads to a significant increase for a two-step approach in which the grasp-type classification
in grasp-type classification accuracy in hand amputees also for is performed after the identification of the movement onset,
the dynamic condition. Focusing on the able-bodied subjects, which can reduce the time window to identify the target object
the inclusion of visual information led to an average increase correctly. However, the use of an sEMG-based classifier as an
of 7.37 ± 3.52%, with the sEMG-based grasp-type recognition intention detector still does not allow to exploit the entire time
and the multimodal approach reaching an average accuracy of window for the identification of the target object, as a latency

Frontiers in Artificial Intelligence | www.frontiersin.org 7 January 2022 | Volume 4 | Article 744476


Cognolato et al. Visuomotor Multimodal Robotic Hand Control

FIGURE 5 | Comparison of grasp-type classification accuracy obtained with sEMG-based and multimodal approaches in transradial amputees and able-bodied
subjects for the dynamic condition. See Figure 4 for information about boxplots’ parameters. *p <0.05, **p <0.01, and ***p <0.001. (A) Transradial amputees. (B)
Able-bodied subjects.

between the start of a movement and the muscle activation exists is completed, in order to plan the next step (Land et al., 1999;
and was shown to be significantly longer for amputees (Gregori Johansson et al., 2001; Land, 2006; Gregori et al., 2019). Finally,
et al., 2019). The inclusion of forearm kinematics [e.g., via via no significant difference was found between the amputees and
inertial measurement units (IMUs)] can probably improve this able-bodied subjects within each condition for what concerns
aspect, even though a muscular activation clearly marks the user the target object identification. This is consistent with the results
intention of moving the hand, while arm kinematics intrinsically from Gregori et al. (2019), where similar visuomotor strategies
have a higher degree of variability, particularly in unconstrained were found for the two groups.
environments where the rest-grasp transition might not be A comparison of the results for the unimodal sEMG-based
clearly identifiable. Furthermore, the use of a continuous gaze grasp-type classification with the state of the art is not easy
and object tracking to identify the object looked at by the user to perform due to the differences in data, protocols, and
eliminated also the need to recognize the gaze fixation on the subjects. Although different for data segmentation, the closest
target object, which is not a trivial task and it has the disadvantage investigation in terms of data and protocol is the validation
of shortening the time window for the grasp identification. The given in Cognolato et al. (2020). The grasp-type classification
main disadvantage of a continuous gaze and object tracking, accuracies achieved in this work with the ConvLSTM are in
however, is the computational demand. line with those in Cognolato et al. (2020). Although the Kernel
On the other hand, tools such as Mask R-CNN and You Only Regularized Least Squares with a nonlinear exponential χ 2 kernel
Look Once (YOLO) (Redmon and Farhadi, 2018) can perform and marginal Discrete Wavelet Transform features achieved
object detection and segmentation in real-time, which can be better performance (Cognolato et al., 2020), the main advantages
made more efficient by restricting the detection area to the of using ConvLSTM are the complete absence of the feature
surroundings of the gaze point. extraction step, and the use of shorter time-windows. The
The results show a gap of approximately 10 % in the correct difference in unimodal sEMG-based grasp-type classification
object identification between the static and dynamic conditions, accuracy between amputees and intact subjects is roughly
which was found to be statistically significant. This difference 2 %. This result is in line with the findings from Cognolato
may be due to the tasks that are performed under the two et al. (2020), where traditional and well-established machine
conditions. First, the increased number of objects placed in learning approaches were employed. Another point that merits
the scene for the dynamic condition and therefore their spatial further analysis is the extent to which the displacement of
vicinity can facilitate the selection of an object near the target a real object for the able-bodied subject (which was not
one, resulting in incorrect identification. Second, the different “materially” performed by the amputees) might have influenced
gaze behavior shown in Gregori et al. (2019) for in-place, lifting, the comparison between the two groups, as it is well known
and displacement actions could also have influenced the correct that changes in force level negatively influence the sEMG-based
object identification. For in-place actions (i.e., when the object grasp-type classification (Campbell et al., 2020).
is not moved), the participants only had to localize the target The inclusion of visual information substantially increased
object in order to plan the motor action accordingly (i.e., defined the average grasp-type classification accuracy. This trend is
as locating fixation by Land et al., 1999) (Land et al., 1999; maintained for both the static and dynamic conditions, with an
Johansson et al., 2001; Land, 2006; Gregori et al., 2019). Instead average accuracy gain of approximately 15 % for hand amputees
of this, a series of activities were requested for the lifting and and roughly 7 % for intact subjects, and it is found to be
displacement actions (contained only in the dynamic condition), statistically significant for both population groups. These results
which commonly cause a gaze shift before the current action reveal the benefit of merging gaze and visual information with

Frontiers in Artificial Intelligence | www.frontiersin.org 8 January 2022 | Volume 4 | Article 744476


Cognolato et al. Visuomotor Multimodal Robotic Hand Control

the traditional sEMG-based approach to improve the grasp Having the data analysis procedure fully based on deep
type classification, and it is in line with the findings of Gigli neural networks can lead to the seamless integration of different
et al. (2018) and Gregori (2019). Although the different data modalities and to faster models, particularly at the testing phase
segmentation makes the results not directly comparable, our (e.g., Ren et al., 2015). The first point represents a potential
results showed a stronger increase for both populations than starting point for future work targeting multimodal data analysis
the one reported in Gigli et al. (2018) and Gregori (2019). In employing multiple data acquisition techniques. The second one
addition to the different data segmentation, also the different can lead to better real-time applications, obtained by reducing
methods for performing the sEMG-based grasp-type recognition, the number of separate processes that are required to achieve the
for extracting the visual information and for performing the data same task.
fusion have likely contributed to this difference. On the other Finally, the multimodal approach showed a small increment
hand, both approaches indicate the benefit of including visual in misclassification toward the rest class. This increase of
information for grasp-type recognition for both populations, misclassifications might be a consequence of the “releasing”
with the amputees showing an increase roughly twice the one strategy employed in the approach, where a rest sample from the
of the able-bodied subjects. Further efforts should be put to unimodal sEMG-based grasp-type classification marks the end
analyze the reasons behind the different increase in classification of the prehension. In this case, the control returns to be purely
accuracy between the two populations. A possible hypothesis, sEMG-based, re-initializing the search for a new target object,
to be verified in future works, is that the different increase of which is improbable to be found for the dynamic condition,
classification accuracy between the two populations might be due as the gaze has likely been moved to the next activity “step.”
to the fact that the performance of the computer vision part of A more robust identification of the prehension completion, for
the pipeline (which is similar in the two groups) is capable to fully example by requiring a minimum number of consecutive rest
counterbalance the low performance of sEMG until a certain level samples instead of a single one, could improve this aspect. On
of accuracy. the other hand, this would increase the delay between the user
It should be noted that the approach proposed in this work intent and prosthesis reaction, reducing the speed of the control
performs a grasp-type classification based on the sEMG modality in an online application.
and the information about the suitable grasp types for the
target object is merged after the classification step. In addition, 4.1. Limitations and Further Improvements
this information is used to select the class with the highest In order to achieve our objectives, we decided to limit additional
recognition rate among those paired with the target object. uncertainties regarding the object recognition and the grasp-
Therefore, it seems reasonable that this approach has more types paired with it. However, in a real scenario, the object
influence in cases of uncertainty between classes (i.e., when recognition accuracy is likely to be lower than the one achieved
several classes are similarly likely to be correct) than when in this work with a network fine-tuned on the objects composing
the classifier assigns a high probability to one of the classes the acquisition setup (Gregori et al., 2019), and the grasp types
(either correct or not) paired to the object. The approach can suitable for a specific object are commonly not fully known a
exclude grasp types with a similar likelihood to the correct priori. Both aspects can influence the improvement achievable.
one in the former scenario, thereby improving the recognition A limitation of this work is that the computer vision pipeline
of the appropriate class, whereas it has no effect if a high was tuned on the same objects that were used in the acquisition
probability is given to an incorrect class among the suitable protocol. This approach was applied to be consistent with the
ones. It is also worth noticing that both performance and electromyography data analysis procedure. Nevertheless, object
improvements depend on the chosen classifier as well as on its recognition accuracy in a real scenario is likely to be lower
ability to correctly identify a grasp type, as it seems reasonable than the one achieved in our work. A perspective of object
that the benefit of including additional and complementary recognition for grasping without fine tuning can be found in
sources of information decreases as the performance of the Gigli et al. (2018), showing that multimodal fusion increases the
unimodal classification increases. However, given that a greater classification success rate considerably, even if fine tuning is not
improvement for amputees was also obtained in Gregori (2019) performed. An alternative perspective of object recognition for
with a different approach, further investigations might enlighten grasping by exploiting dedicated training datasets in provided
the reason of this difference. by Ghazaei et al. (2017). Despite the fact that in a real scenario
The proposed method indicates the viability of taking object recognition accuracy is likely to be lower than the one
advantage of a natural human behavior to improve the grasp- achieved in our work, we expect this aspect to be improved in
type recognition by having the control system retrieving the future thanks to new resources which are being developed,
complementary information autonomously, without placing any increasing the applicability to real life applications. In fact, the
additional burden on the user. Furthermore, the approach is presented work is one of the first approaches exploring what
mainly driven by the sEMG modality, which is the direct user- can be done fusing electromyography, computer vision and eye
device interface, making the autonomous part of the method tracking data using deep learning approaches to mimic human
as unobtrusive as possible. This was done in an effort to limit eye-hand coordination. In this moment, real-life applications
the conscious and visual attention demands, which was deemed would at least benefit from fine tuning models on dedicated
as an aspect needing improvements by myoelectric prosthesis datasets (Ghazaei et al., 2017), bringing their performance at
users (Atkins et al., 1996; Cordella et al., 2016). least closer to the ones described in this paper, and of newer

Frontiers in Artificial Intelligence | www.frontiersin.org 9 January 2022 | Volume 4 | Article 744476


Cognolato et al. Visuomotor Multimodal Robotic Hand Control

computer vision architectures. In addition, real life applications neural network architecture based on a ConvLSTM performs the
of this system in products will most likely require years, during unimodal sEMG-based grasp-type classification, while the object
which performance in computer vision will probably continue to recognition and segmentation are executed with a Mask R-CNN.
advance, with dedicated architectures, leading to better models A grasp-type classification based on the sEMG is continuously
for grasp classification too, even without or with limited fine- performed, allowing to identify grasp intents by leveraging
tuning.” On the other hand, information on the suitable grasp the ability of the network to distinguish between the resting
types can be extracted from the first-person video without the and grasping conditions. The identification of a grasp intent
need of recognizing the object, for example by evaluating the triggers the search for the target object based on eye-hand
object’s characteristics (e.g., shape, size) with computer vision coordination parameters and in the case of an object being
approaches (Došen et al., 2010; Hao et al., 2013; Markovic identified, the grasp type is selected among the suitable ones
et al., 2015), deep learning techniques (Redmon and Angelova, for the recognized objects. Otherwise, the approach continues
2015; Ghazaei et al., 2017; Gigli et al., 2018; Taverne et al., the grasp-type classification relying only on the sEMG modality.
2019), or by evaluating its affordances (Nguyen et al., 2016). The results show that the multimodal approach significantly
Moreover, the position of the gaze on the object can also increases the performance in transradial amputees and able-
help to discriminate among multiple affordances, as objects bodied subjects. In both the static and dynamic conditions, the
can commonly be grabbed with several grasp types (e.g., if the performance increment obtained with the multimodal approach
gaze point is on the bottle cap, it is more likely that the user allowed the grasp-type classification accuracy in transradial
is planning to open the bottle, thus suggesting the use of a amputees to be comparable with the one obtained in able-bodied
tripod grasp). Considering the unimodal sEMG-based grasp-type subjects, without placing additional control burden on the user.
classification, although the chosen network achieved results in The results therefore show the benefit of a multimodal grasp-
line with the one shown for the dataset validation (Cognolato type classification and suggest the usefulness of the approach
et al., 2020), other networks and architectures might further based on eye-hand coordination. Moreover, the availability of
improve the performance, which may also limit the potential the dataset allows for further investigations and improvements,
benefit of including complementary information. which are desirable to obtain an approach that can be tested in
An additional point concerns the data fusion, because when online applications.
the approach fails to detect the correct object an incorrect grasp
type is likely to be chosen. A further refinement may select the
final grasp type by taking into account the confidence of the
DATA AVAILABILITY STATEMENT
recognition from both modalities, weighting the final selection The datasets presented in this study can be found in online
toward the most promising one. Considering that rest is equally repositories. The names of the repository/repositories
classified in unimodal and multimodal analysis, fully including it and accession number(s) can be found here:
might influence the classification performance, possibly reducing doi: 10.7910/DVN/1Z3IOM.
the difference in performance between the two approaches. The
need to wear an eye tracker might affect the usability of the setup.
On the other hand, novel devices similar to normal eyeglasses ETHICS STATEMENT
are now on the market and it is plausible to think of future
improvements that can make this technology even less obtrusive, The experiment was designed and conducted in accordance
for example by integrating it into standard glasses or even in with the principles expressed in the Declaration of Helsinki.
contact lenses (Sako et al., 2016; Pupil Labs, 2020). Ethical approval for our study was requested to and approved
Finally, to validate the viability and performance of the by the Ethics Commission of the canton of Valais in Switzerland
approach, it should be implemented and tested in a real-time (CCVEM 010/11) and by the Ethics Commission of the Province
fashion with transradial amputees, possibly during the execution of Padova in Italy (NRC AOP1010, CESC 4078/AO/17). Prior to
of ADL in unconstrained environments. the experiment, each subject was given a detailed written and
oral explanation of the experimental setup and protocol. They
were then required to give informed consent to participate in the
5. CONCLUSION research study.
The aim of this work was to investigate if a multimodal
approach leveraging a natural human behavior (i.e., the eye– AUTHOR CONTRIBUTIONS
hand coordination) can improve the challenge of classifying
several grasp types for hand prosthesis control. The results MC contributed to the design of the multimodal approach,
are encouraging, showing that the fusion of electromyography, performed the data analysis, and wrote the manuscript. MA
gaze, and first-person video data increases the offline grasp- contributed to the design of the multimodal approach, to
type classification performance of transradial amputees. We the ideation of the data analysis procedure, and revised the
used the publicly available MeganePro Dataset 1, containing manuscript. RG contributed to the design of the multimodal
sEMG, gaze, first-person video data collected from 15 transradial approach and revised the manuscript. HM contributed to the
amputees and 30 able-bodied subjects performing grasping tasks design of the multimodal approach, to the conception of the
on household objects in static and dynamic conditions. A deep data analysis procedure, and revised the manuscript. All authors

Frontiers in Artificial Intelligence | www.frontiersin.org 10 January 2022 | Volume 4 | Article 744476


Cognolato et al. Visuomotor Multimodal Robotic Hand Control

contributed to manuscript revision, read and approved the Andrearczyk, and Arjan Gijsberts for the insightful discussions,
submitted version. and the MeganePro consortium consisting of Peter Brugger,
Gianluca Saetta, Katia Giacomino, Anne-Gabrielle Mittaz Hager,
FUNDING Diego Faccio, Cesare Tiengo, Franco Bassetto, Valentina Gregori,
Arjan Gijsberts, and Barbara Caputo.
This work was partially supported by the Swiss National Science
Foundation Sinergia project #160837 MeganePro.
SUPPLEMENTARY MATERIAL
ACKNOWLEDGMENTS
The Supplementary Material for this article can be found
The authors would like to thank Prof. Giuseppe Marcolin for the online at: https://www.frontiersin.org/articles/10.3389/frai.2021.
assistance in statistical analysis, Drs. Yashin Dicente Cid, Vincent 744476/full#supplementary-material

REFERENCES Farina, D., Jiang, N., Rehbaum, H., Holobar, A., Graimann, B., Dietl, H.,
et al. (2016). Reflections on the present and future of upper limb
Al-Timemy, A. H., Khushaba, R. N., Bugmann, G., and Escudero, J. (2016). prostheses. Expert. Rev. Med. Devices 13, 321–324. doi: 10.1586/17434440.2016.
Improving the performance against force variation of EMG controlled 1159511
multifunctional upper-limb prostheses for transradial amputees. IEEE Trans. Farina, D., Ning, J.iang, Rehbaum, H., Holobar, A., Graimann, B., Dietl,
Neural Syst. Rehabil. Eng. 24, 650–661. doi: 10.1109/TNSRE.2015.2445634 H., and Aszmann, O. C. (2014). The extraction of neural information
Amsuess, S., Vujaklija, I., Goebel, P., Roche, A. D., Graimann, B., Aszmann, from the surface EMG for the control of upper-limb prostheses: emerging
O. C., et al. (2016). Context-dependent upper limb prosthesis control for avenues and challenges. IEEE Trans. Neural Syst. Rehabil. 22, 797–809.
natural and robust use. IEEE Trans. Neural Syst. Rehabil. Eng. 24, 744–753. doi: 10.1109/TNSRE.2014.2305111
doi: 10.1109/TNSRE.2015.2454240 Faust, O., Hagiwara, Y., Hong, T. J., Lih, O. S., and Acharya, U. R.
Atkins, D. J., Heard, D. C. Y., and Donovan, W. H. (1996). Epidemiologic overview (2018). Deep learning for healthcare applications based on physiological
of individuals with upper-limb loss and their reported research priorities. JPO signals: a review. Comput. Methods Programs Biomed. 161, 1–13.
J. Prosthet. Orthot. 8, 2–11. doi: 10.1097/00008526-199600810-00003 doi: 10.1016/j.cmpb.2018.04.005
Atzori, M., Cognolato, M., and Müller, H. (2016). Deep learning with Geng, W., Du, Y., Jin, W., Wei, W., Hu, Y., and Li, J. (2016). Gesture recognition by
convolutional neural networks applied to electromyography data : a resource instantaneous surface EMG images. Sci. Rep. 6:36571. doi: 10.1038/srep36571
for the classification of movements for prosthetic hands. Front. Neurorobot. Ghazaei, G., Alameer, A., Degenaar, P., Morgan, G., and Nazarpour, K. (2017).
10:9. doi: 10.3389/fnbot.2016.00009 Deep learning-based artificial vision for grasp classification in myoelectric
Campbell, E., Phinyomark, A., and Scheme, E. (2020). Current trends and hands. J. Neural Eng. 14. doi: 10.1088/1741-2552/aa6802 [Epub ahead of print].
confounding factors in myoelectric control: Limb position and contraction Gigli, A., Gregori, V., Cognolato, M., Atzori, M., and Gijsberts, A. (2018). “Visual
intensity. Sensors 20, 1–44. doi: 10.3390/s20061613 cues to improve myoelectric control of upper limb prostheses,” in 2018 7th IEEE
Castellini, C., Artemiadis, P., Wininger, M., Ajoudani, A., Alimusaj, M., International Conference on Biomedical Robotics and Biomechatronics (Biorob)
Bicchi, A., et al. (2014). Proceedings of the first workshop on peripheral (Enschede: IEEE), 783–788.
machine interfaces: Going beyond traditional surface electromyography. Front. Gijsberts, A., and Caputo, B. (2013). “Exploiting accelerometers to improve
Neurorobot. 8:22. doi: 10.3389/fnbot.2014.00022 movement classification for prosthetics,” in 2013 IEEE 13th International
Castellini, C., Passig, G., and Zarka, E. (2012). Using ultrasound images of the Conference on Rehabilitation Robotics (ICORR) (Seattle, WA: IEEE), 1–5.
forearm to predict finger positions. IEEE Trans. Neural Syst. Rehabil. Eng. 20, Gregori, V. (2019). An Analysis of the Visuomotor Behavior of Upper Limb
788–797. doi: 10.1109/TNSRE.2012.2207916 Amputees to Improve Prosthetic Control (Ph.D. thesis). University of Rome, “La
Castellini, C., and Sandini, G. (2007). “Learning when to grasp,” in Invited Paper at Sapienza”.
Concept Learning for Embodied Agents (Rome: Workshop of ICRA). Gregori, V., Cognolato, M., Saetta, G., Atzori, M., and Gijsberts, A. (2019). On
Cognolato, M., Gijsberts, A., Gregori, V., Saetta, G., Giacomino, K., Mittaz the visuomotor behavior of amputees and able-bodied people during grasping.
Hager, A.-G., et al. (2020). Gaze, visual, myoelectric, and inertial data of Front. Bioeng. Biotechnol. 7:316. doi: 10.3389/fbioe.2019.00316
grasps for intelligent prosthetics. Sci. Data 7:43. doi: 10.1038/s41597-020-0 Hao, Y., Controzzi, M., Cipriani, C., Popovic, D., Yang, X., Chen, W., et al. (2013).
380-3 Controlling hand-assistive devices: utilizing electrooculography as a substitute
Cognolato, M., Graziani, M., Giordaniello, F., Saetta, G., Bassetto, F., Brugger, for vision. IEEE Rob. Autom. Mag. 20, 40–52. doi: 10.1109/MRA.2012.2229949
P., et al. (2017). “Semi-automatic training of an object recognition system in He, K., Gkioxari, G., Dollar, P., and Girshick, R. (2017). “Mask r-cnn,” in The IEEE
scene camera data using gaze tracking and accelerometers,” in Computer Vision International Conference on Computer Vision (ICCV) (Venice: IEEE).
Systems. ICVS 2017. Lecture Notes in Computer Science, Vol. 10528, eds M. Liu, He, K., Zhang, X., Ren, S., and Sun, J. (2016). “Deep residual learning for
H. Chen, and M. Vincze (Cham: Springer), 175–184. image recognition,” in The IEEE Conference on Computer Vision and Pattern
Corbett, E. A., Kording, K. P., and Perreault, E. J. (2012). Real-time fusion of gaze Recognition (CVPR) (Las Vegas, NV: IEEE).
and emg for a reaching neuroprosthesis. Ann. Int. Conf. IEEE Eng. Med. Biol. Hochreiter, S., and Schmidhuber, J. (1997). Long short-term memory. Neural
Soc. 2012, 739–742. doi: 10.1109/EMBC.2012.6346037 Comput. 9, 1–32. doi: 10.1162/neco.1997.9.1.1
Cordella, F., Ciancio, A. L., Sacchetti, R., Davalli, A., Cutti, A. G., Guglielmelli, E., Hudgins, B., Parker, P., and Scott, R. N. (1993). A new strategy for
et al. (2016). Literature review on needs of upper limb prosthesis users. Front. multifunction myoelectric control. IEEE Trans. Biomed. Eng. 40, 82–94.
Neurosci. 10:209. doi: 10.3389/fnins.2016.00209 doi: 10.1109/10.204774
Došen, S., Cipriani, C., Kostić, M., Controzzi, M., Carrozza, M. C., and Jaquier, N., Connan, M., Castellini, C., and Calinon, S. (2017). Combining
Popović, D. B. (2010). Cognitive vision system for control of dexterous electromyography and tactile myography to improve hand and wrist activity
prosthetic hands: Experimental evaluation. J. Neuroeng.. Rehabil. 7:42. detection in prostheses. Technologies 5, 64. doi: 10.3390/technologies,5040064
doi: 10.1186/1743-0003-7-42 JASP Team (2020). JASP (Version 0.12)[Computer software].
Došen, S., and Popović, D. B. (2010). Transradial prosthesis: artificial Jiang, N., Dosen, S., Muller, K.-R., and Farina, D. (2012). myoelectric control of
vision for control of prehension. Artif. Organs. 35, 37–48. artificial limbs–is there a need to change focus? [In the Spotlight]. IEEE Signal
doi: 10.1111/j.1525-1594.2010.01040.x Proc. Mag. 29, 152–150. doi: 10.1109/MSP.2012.2203480

Frontiers in Artificial Intelligence | www.frontiersin.org 11 January 2022 | Volume 4 | Article 744476


Cognolato et al. Visuomotor Multimodal Robotic Hand Control

Johansson, R. S., Westling, G., Bäckström, A., and Flanagan, J. R. (2001). Redmon, J., and Angelova, A. (2015). “Real-time grasp detection using
Eye-hand coordination in object manipulation. J. Neurosci. 21, 6917–6932. convolutional neural networks,” in 2015 IEEE International Conference on
doi: 10.1523/JNEUROSCI.21-17-06917.2001 Robotics and Automation (ICRA), volume 2015 (Seattle, WA: IEEE), 1316–1322.
Khushaba, R. N., Al-Ani, A., Al-Timemy, A., and Al-Jumaily, A. (2017). “A fusion Redmon, J., and Farhadi, A. (2018). Yolov3: An incremental improvement. arXiv
of time-domain descriptors for improved myoelectric hand control,” in 2016 preprint arXiv:1804.02767.
IEEE Symposium Series on Computational Intelligence, SSCI 2016 (Athens: Ren, S., He, K., Girshick, R., and Sun, J. (2015). Faster r-cnn: Towards real-time
IEEE). object detection with region proposal networks. Adv. Neural Inf. Process Syst.
Khushaba, R. N., Takruri, M., Miro, J. V., and Kodagoda, S. (2014). Towards 28, 91–99.
limb position invariant myoelectric pattern recognition using time-dependent Sako, Y., Iwasaki, M., Hayashi, K., Kon, T., Nakamura, T., Onuma, T., et al. (2016).
spectral features. Neural Netw. 55, 42–58. doi: 10.1016/j.neunet.2014.03.010 Contact lens and storage medium. US Patent. Appl, (14/785249).
Kingma, D. P., and Ba, J. (2014). “Adam: a method for stochastic optimization,” Scheme, E., and Englehart, K. (2011). Electromyogram pattern recognition for
in 3rd International Conference on Learning Representations (San Diego, CA: control of powered upper-limb prostheses: state of the art and challenges for
ICLR). clinical use. J. Rehabil. Res. Dev. 48, 643–660. doi: 10.1682/JRRD.2010.09.0177
Land, M., Mennie, N., and Rusted, J. (1999). The roles of vision and eye Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., and Woo, W.-C. (2015).
movements in the control of activities of daily living. Perception 28, 1311–1328. “Convolutional lstm network: a machine learning approach for precipitation
doi: 10.1068/p2935 nowcasting,” in Advances in Neural Information Processing Systems 28, eds C.
Land, M. F. (2006). Eye movements and the control of actions in everyday life. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Montreal,
Prog. Retin. Eye Res. 25, 296–324. doi: 10.1016/j.preteyeres.2006.01.002 QC: Curran Associates, Inc.), 802–810.
Lin, T.-Y., Dollár, P., Girshick, R., He, K., Hariharan, B., and Belongie, S. (2017). Taverne, L. T., Cognolato, M., Butzer, T., Gassert, R., and Hilliges, O.
“Feature pyramid networks for object detection,” in Proceedings of the IEEE (2019). “Video-based prediction of hand-grasp preshaping with application
Conference on Computer Vision and Pattern Recognition (Honolulu, HI: IEEE), to prosthesis control,” in 2019 International Conference on Robotics and
2117–2125. Automation (ICRA) (Montreal, QC: IEEE), 4975–4982.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. Tomovic, R., and Boni, G. (1962). An adaptive artificial hand. IRE Trans. Autom.
(2014). “Microsoft coco: common objects in context,” in Computer Vision- Control 7, 3–10. doi: 10.1109/TAC.1962.1105456
ECCV 2014, eds D. Fleet, T, Pajdla, B, Schiele, and T. Tuytelaars (Cham: Vujaklija, I., Farina, D., and Aszmann, O. (2016). New developments in prosthetic
Springer International Publishing), 740–755. arm systems. Orthop. Res. Rev. 8, 1–39. doi: 10.2147/ORR.S71468
Markovic, M., Dosen, S., Cipriani, C., Popovic, D., and Farina, D. (2014).
Stereovision and augmented reality for closed-loop control of grasping in hand Conflict of Interest: The authors declare that the research was conducted in the
prostheses. J. Neural Eng. 11:046001. doi: 10.1088/1741-2560/11/4/046001 absence of any commercial or financial relationships that could be construed as a
Markovic, M., Dosen, S., Popovic, D., Graimann, B., and Farina, D. potential conflict of interest.
(2015). Sensor fusion and computer vision for context-aware control
of a multi degree-of-freedom prosthesis. J. Neural Eng. 12:066022.
Publisher’s Note: All claims expressed in this article are solely those of the authors
doi: 10.1088/1741-2560/12/6/066022
Massa, F., and Girshick, R. (2018). maskrcnn-Benchmark: Fast, Modular Reference and do not necessarily represent those of their affiliated organizations, or those of
Implementation of Instance Segmentation and Object Detection Algorithms the publisher, the editors and the reviewers. Any product that may be evaluated in
in PyTorch [Online]. Available online: https://github.com/facebookresearch/ this article, or claim that may be made by its manufacturer, is not guaranteed or
maskrcnn-benchmark. (accessed April 05, 2020). endorsed by the publisher.
Nguyen, A., Kanoulas, D., Caldwell, D. G., and Tsagarakis, N. G. (2016). Detecting
object affordances with convolutional neural networks. IEEE Int. Conf. Intell. Copyright © 2022 Cognolato, Atzori, Gassert and Müller. This is an open-access
Rob. Syst. 201, 2765–2770. doi: 10.1109/IROS.2016.7759429 article distributed under the terms of the Creative Commons Attribution License (CC
Phinyomark, A., Phukpattaranont, P., and Limsakul, C. (2012). Feature reduction BY). The use, distribution or reproduction in other forums is permitted, provided
and selection for EMG signal classification. Expert. Syst. Appl. 39, 7420–7431. the original author(s) and the copyright owner(s) are credited and that the original
doi: 10.1016/j.eswa.2012.01.102 publication in this journal is cited, in accordance with accepted academic practice.
Pupil Labs (2020). Pupil Labs Invisible [Online]. Available online at: https://pupil- No use, distribution or reproduction is permitted which does not comply with these
labs.com/products/invisible/ terms.

Frontiers in Artificial Intelligence | www.frontiersin.org 12 January 2022 | Volume 4 | Article 744476

You might also like