Quantifying_the_Effects_of_Ground_Truth_Annotation_Quality_on_Object_Detection_and_Instance_Segmentation_Performance
Quantifying_the_Effects_of_Ground_Truth_Annotation_Quality_on_Object_Detection_and_Instance_Segmentation_Performance
ABSTRACT Fully-supervised object detection and instance segmentation models have accomplished
notable results on large-scale computer vision benchmark datasets. However, fully-supervised machine
learning algorithms’ performances are immensely dependent on the quality of the training data. Preparing
computer vision datasets for object detection and instance segmentation is a labor-intensive task requiring
each instance in an image to be annotated. In practice, this often results in the quality of bounding
box and polygon mask annotations being suboptimal. This paper quantifies empirically the ground truth
annotation quality and COCO’s mean average precision (mAP) performance by introducing two separate
noise measures, uniform and radial, into the ground truth bounding box and polygon mask annotations for
the COCO and Cityscapes datasets. Mask-RCNN models are trained on various levels of noise measures to
investigate the performance of each level of noise. The results showed degradation of mAP as the level of
both noise measures increased. For object detection and instance segmentation respectively, using the highest
level of noise measure resulted in a mAP degradation of 0.185 & 0.208 for uniform noise with reductions of
0.118 & 0.064 for radial noise on the COCO dataset. As for the Cityscapes datasets, reductions of mAP
performance of 0.147 & 0.142 for uniform noise and 0.101 & 0.033 for radial noise were recorded.
Furthermore, a decrease in average precision is seen across all classes, with the exception of the class
motorcycle. The reductions between classes vary, indicating the effects of annotation uncertainty are class-
dependent.
INDEX TERMS Annotation uncertainty, computer vision, instance segmentation, object detection, super-
vised learning.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
25174 VOLUME 11, 2023
C. Agnew et al.: Quantifying the Effects of Ground Truth Annotation Quality
We believe another key factor for deep learning’s recent relationship between annotation quality and performance will
success is the availability and facility of deep learning frame- yield helpful insight into the trade-off between annotation
works such as TensorFlow [9], PyTorch [10] and Apache quality and the time and cost associated with such annotation
MXNet [11], which enabled deep learning to become more quality. This information in turn will allow for informed
accessible to the broader research community. The recent decision-making and enables the tailoring of annotations to
advancements in deep learning methodologies for computer the use case of the application.
vision tasks have yielded momentous technologies in many The paper is structured as follows. In Section II an
domains such as intelligent transportation systems [12], [13], overview of related work is discussed. Then, in Section III,
sports analytics [14], [15] and medical imaging [16], [17]. an explanation for how annotation uncertainty is modeled for
These advancements are not restricted to RGB imagery, with this work is given. This is followed by a description of the
advances in infrared [18] and hyperspectral imagery [19]. experiment in Section IV and a presentation of experimental
Neural networks’ performance for supervised computer results in Section V. In Section VI, these results are analyzed
vision tasks relies on the data they are trained on. This and discussed. Lastly, in Section VII, the conclusions of this
includes the annotations that are utilized as ground truth for work are summarised.
supervised learning algorithms. Sun et al. found that perfor-
mance on vision tasks improves logarithmically as the train- II. RELATED WORK
ing dataset size increases [7]. Due to the large quantity of data Taran et al. used Cityscapes [31] fine and coarse annotated
that is regularly needed, the process of annotating datasets images to investigate the effects ground truth annotation
for computer vision-supervised learning tasks is time inten- quality has on semantic image segmentation performance of
sive. For example, it required approximately 60,000 worker traffic conditions [30]. The authors explored two situations,
hours to annotate the Common Objects in Context (COCO) firstly using the fine ground truth annotations for both train-
dataset [20]. For object detection, bounding boxes must be ing and inferencing; secondly training with the fine ground
manually annotated over the classes of interest for the entire truth annotations but inferencing on the coarse ground truth
dataset. Employing a crowd-sourcing method [21] that is annotations. PSPNet [32] was used for the semantic segmen-
optimized for bounding box annotation, each annotation in tation model and a subset of the Cityscapes dataset was used
the ImageNet Visual Recognition dataset [22] took around for the analysis, which included data from 3 cities and the
35s to annotate. For instance segmentation, a polygon mask following classes; road, car, pedestrian, traffic lights, and
must be outlined around each class of interest for the dataset. signs. Using mean intersection over union (IoU) as the metric
Polygon annotations are more accurate than bounding boxes of interest, the authors found the IoU values for coarse ground
but are also more laborious. This is reflected by the annotation truth annotated images in general, were higher than those for
time estimated to be 79.2s per polygon mask for the popular fine ground truth annotated images. In light of the results
COCO dataset [20]. of comparing fine and coarse ground truth annotations, the
The importance of ground truth annotation quality has authors suggest that deep neural networks could be utilized
been acknowledged for computer vision tasks in the litera- to generate coarse ground truth annotated datasets, that can
ture, with methods being developed attempting to rectify and be modified and used to fine-tune the pre-trained models for
account for noisy labels in computer vision tasks such as the specific application.
object detection [23], [24] and image classification [25], [26], A study by Mullen Jr et al. [33] compared annotation
[27], [28]. To the authors’ knowledge, there is limited lit- types and their effects on object detection performance on
erature attempting to quantify the effects the ground truth the Overhead Imagery Research Dataset (OIRDS) [34]. Three
bounding box and polygon mask annotation quality have on annotation types were considered for the analysis to detect
object detection and instance segmentation performance. cars from the OIRDS; polygon masks, bounding boxes, and
The main contribution of this paper is to quantify empir- target centroids. A modified version of the Overfeat [35]
ically the annotation quality levels and their effects on network architecture was used for the analysis. A Receiver
mAP [29] by introducing noise into both bounding boxes Operating Characteristic (ROC) curve assessed at all pixel
and polygon masks on a subset of the COCO dataset. To the locations along with the area under the curve (AUC) was
authors’ knowledge, this work is the first to investigate anno- calculated for the 3 annotation types. The results showed
tation uncertainty for three different aspects. To begin with, polygon mask annotations scored marginally better AUC than
this work is the first to investigate annotation uncertainty the other two annotation types. The authors concluded when
for instance segmentation. Secondly, for object detection, putting together a dataset for deep learning, comparing anno-
this work introduces noise to the scale of pixel distance, tation types is a key step, as the cost of annotations and the
allowing a finer scale of annotation uncertainty which may advantages and disadvantages of each annotation type should
be more representative of annotation uncertainty seen in be considered.
practice. Finally, the effects of annotation uncertainty on Xu et al. investigated training object detectors with noisy
each individual class’ average precision (AP) performance labels [23], including incorrect class labels and imprecise
are investigated to provide further insight. Quantifying the bounding boxes on both PASCAL VOC 2012 [36] and COCO
B. TRAINING SETUP
The MMdetection framework [46] was used to train 12 hours to train for Cityscapes. A batch size of 2 images
Mask-RCNN models with a ResNet-50 backbone for the was utilized, due to hardware constraints, with a stochastic
experiments [3], [47]. One advantage of this model is its gradient descent (SGD) optimizer using a learning rate of
ability to output both object detection and instance segmen- 0.02, a momentum of 0.9, and a weight decay of 0.0001.
tation results [3]. All experiments were conducted on a sin- A learning rate scheduler was utilized to drop the learning
gle workstation with an NVIDIA GeForce RTX 3060 GPU rate by a factor of 10 at training epoch numbers 65 and 71.
card with CUDA 11.6. For the experiments, the training and Evidence of over-fitting was apparent after epoch 66 when
validation annotations contained the relevant induced noise. training on the ground truth annotations, as seen in Fig. 7 for
The test dataset remained with the original annotations and the COCO dataset. As for Cityscapes, evidence of over-fitting
has not been tampered with. The Mask-RCNN models were was apparent after epoch 65. The model weights from epoch
trained from scratch for 73 epochs, which took approximately 66 and 65 were used for inferencing on the respective test
120 hours per model to train for the COCO dataset and datasets.
C. METRICS
COCO’s defined mean average precision [29] (mAP) is
the primary metric of interest. When considering individual
classes, average precision (AP) can be and is used in place
of mAP. A breakdown of mAP and the individual classes’
AP results will be reported to provide further insight into
annotation quality and performance. COCO’s definitions for
mean average precision for small, medium, and large objects
are denoted by mAPs , mAPm , and mAPl , whereas the mean
average precision requiring an IoU threshold of 0.5 and
0.75 are denoted by mAP0.5 and mAP0.75 . mAP0.50:0.05:0.95
denotes the COCO primary challenge metric [29].
Whereas comparing the reduction in mAP scores provides
insight into how the individual components of mAP degraded,
this would not yield an appropriate comparison between
object sizes or classes. For example, if the initial score for
mAPs = 0.1, using the original annotations, the most mAPs FIGURE 9. COCO dataset Object Detection mAP results.
could degrade is its initial starting point. To put this into
perspective, if mAPl = 0.5 using the original annotations,
and was to degrade by 0.15 when using noise-induced anno-
tations, it would result in an mAPl = 0.35. However, looking
only at differences, mAPs has degraded less, yet no small
objects are being detected.
To provide further insight, linear regression models were
fitted to the individual components of mAP scores for each
level of noise, in an attempt to provide a standardized com-
parison between object classes and sizes, that relates directly
to mAP. The linear regression models were fitted on a single
variable, induced noise level, which in turn gives the interpre-
tation of the β coefficient; for a one-unit increase in induced
noise level, on average the mAP score will increase by β.
V. RESULTS
The results were obtained from the test set of 1,775 images
of the COCO dataset and 500 test images for the Cityscapes
dataset. The approximate uniform buffered pixel distances FIGURE 10. Cityscapes dataset object detection mAP results.
used in this experiment range from 1 to 10 inclusive, with
the radial induced noise ranging from σ = 1 to 5. The
ground truth annotations were also used to train a baseline values 0 to 5 inclusive and the second using pixel distance
model. In the figures to follow in this section, a noise level buffering size values 6 to 10.
of 0 refers to the ground truth annotations. Mask R-CNN
models were trained with both noise-induced annotations A. OBJECT DETECTION
along with ground truth annotations. For both the COCO The results of the experiments for object detection are out-
and Cityscapes datasets, 10 datasets, one for each level of lined in this section. In Fig. 9 and Fig. 10 a plot of the
approximate uniform pixel distance, were used to train the individual components of mAP against pixel distance buffer-
models. For radial noise, 5 datasets, one for each level of ing size for the uniform noise and σ for radial noise is
the standard deviation of radial noise were used to train the given for the COCO and Cityscapes datasets respectively.
models. For the fitted linear regression models in this section, Linear regression models were used to model the relationship
β refers to the coefficient of the pixel distance buffering size between induced noise measures and mAP. The results of the
variable for the uniform noise models, whereas for the radial models are presented in Table 2 for the COCO dataset and
noise models β refers to the coefficient of the σ variable. Table 3 for the Cityscapes dataset. In Fig. 11 and Fig. 12
A 95% confidence interval is given in square brackets for a plot of the per-class AP0.50:0.05:0.95 against pixel distance
the constant and β coefficients. Due to the saturation of buffering size for the uniform noise and σ for radial noise
results in mAPs after epoch 5 for the uniform models, two is given for the COCO and Cityscapes datasets respec-
linear regression models were used for mAPs . The first linear tively. The plots are separated by the majority object size
regression model was fit from pixel distance buffering size for the class in the test dataset. This was utilized for ease
TABLE 2. COCO dataset linear regression model results for object TABLE 4. COCO dataset linear regression per-class model results for
detection. object detection.
B. INSTANCE SEGMENTATION
The results of the experiments for instance segmentation are
outlined in this section. In Fig. 13 and Fig. 14 a plot of
the individual components of mAP against pixel distance
buffering size for the uniform noise and σ for radial noise
is given for the COCO and Cityscapes datasets respectively.
Linear regression models were used to model the relationship
between induced noise measures and mAP. The results of the
models are presented in Table 6 for the COCO dataset and
Table 7 for the Cityscapes dataset. In Fig. 15 and Fig. 16
a plot of the per-class AP0.50:0.05:0.95 against pixel distance AP0.50:0.05:0.95 . The results of these models are presented in
buffering size for the uniform noise and σ for radial noise Table 8 and Table 9.
is given for the COCO and Cityscapes datasets respectively.
The plots are separated by the majority object size for the VI. DISCUSSION
class in the test dataset. This was utilized for ease of read- The results enable us to compare the mAP performance
ability. Linear regression models were used to model the for object detection and instance segmentation for vari-
relationship between induced noise measures and per-class ous ground truth annotation qualities. For object detection,
TABLE 9. Cityscapes dataset linear regression per-class model results for would only experience a marginal degradation. However, all
instance segmentation.
things considered, the results show as the degradation of
the annotation increases, a reduction in mAP performance is
observed. This reflects the need for accurate annotations for
supervised learning computer vision tasks.
Radial noise degradation for instance segmentation is
lower than object detection. An explanation for this may be
down to how the radial noise was implemented. As bounding
boxes only require four normally distributed data points to
introduce the noise; the first two update the x & y co-ordinates
for the bounding box starting position, the third updates the
width and the fourth and final point updates the height, there
is a possibility, due to the nature of the normal distribution,
for relatively high values being introduced. This in turn would
significantly degrade the bounding box annotation.
The findings of this study have to be seen in light of
some limitations. To the authors’ knowledge, no prior work
has modelled annotation uncertainty for object detection or
instance segmentation datasets. In light of this information,
the use of uniform noise and normally distributed radial noise
was selected to model annotation uncertainty. This work
allows us to quantify the degradation of mAP with respect
to modelled annotation uncertainty to better understand the
For radial noise, the adjusted R2 is 0.974 for object detection relationship between annotation quality and performance.
and 0.954 for instance segmentation. The β coefficient from
the linear regression models yields insight into quantify- VII. CONCLUSION AND FUTURE WORK
ing the performance degradation. For a one-unit increase in In this paper, the relationship between object detection and
pixel distance buffer size, on average the mAP0.50:0.05:0.95 instance segmentation annotation quality and mAP perfor-
will reduce by -0.0195[-0.022, -0.017] for object detection mance is studied. The observed results were attained by a
and -0.0221[-0.025, -0.019] for instance segmentation. For Mask-RCNN model with a ResNet-50 backbone on a subset
a one-unit increase in σ for radial noise, on average the of the COCO 2017 challenge and Cityscapes datasets. The
mAP0.50:0.05:0.95 will reduce by -0.0241[-0.029, -0.019] for ground truth annotations for both bounding boxes and poly-
object detection and -0.0135[-0.017, -0.010] for instance seg- gon masks had two separate types of noise introduced to the
mentation. annotations; uniform and radial.
A reduction across mAP for object detection and instance For object detection and instance segmentation, both types
segmentation when introducing annotation uncertainty is no of induced noise negatively affected the mAP. When investi-
surprise. As supervised-learning neural network performance gating the per-class AP0.50:0.05:0.95 performance, there was a
relies on the quality of the annotations, a degradation in the reduction seen in all classes but motorcycle used in the experi-
annotation quality will be reflected in a reduction of the mAP. ments, with the reductions between classes varying. This sug-
This work set out to investigate the relationship between gests that annotation quality and AP0.50:0.05:0.95 performance
annotation quality and mAP performance. For both types of is class-dependent. A strong linear relationship was observed
induced noise used in this research, the noise from the training between both noise types and mAP for the COCO dataset.
data has forced the model to include some noise around the An adjusted R2 of 0.978 for uniform noise and 0.974 for
objects of interest when inferencing with the model, thus radial noise was recorded for object detection, with instance
reducing the IoU with the ground truth annotation on the segmentation recording an adjusted R2 of 0.956 for uniform
test set, resulting in a reduction in mAP. This noise during noise and 0.954 for radial noise when using mAP0.50:0.05:0.95 .
inferencing was more apparent for the uniform noise models. For radially-induced noise for instance segmentation, there
However, the model predictions still allow for the identifica- is some robustness for σ = 1, as the degradation is less than
tion and localization of the objects of interest. 2% for all components of mAP. While the required accuracy
A direct comparison between uniform and radially- of mask predictions for instance segmentation is application
induced noise may not be a fair comparison, due to the dependent, this work has quantified the degradation in mAP
nature of the induced noise for both. The uniform noise sig- for varying annotation qualities to help inform any decisions
nificantly degrades each vertex for instance segmentation in on annotation labelling quality and the expected degradation.
comparison to the radial noise. The radial noise was normally This study has quantified empirically the performance
distributed and centred around 0, meaning some vertices between annotation quality and mAP when introducing two
APPENDIX A
COCO DATASET
REFERENCES
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
with deep convolutional neural networks,’’ in Proc. Adv. Neural Inf. Pro-
cess. Syst. (NIPS), vol. 25, 2012, pp. 1106–1114.
[2] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep con-
volutional encoder–decoder architecture for image segmentation,’’ IEEE
Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
Dec. 2017.
[3] K. He, G. Gkioxari, P. Dollar, and R. Girshick, ‘‘Mask R-CNN,’’ in Proc.
ICCV, 2017, pp. 2961–2969.
[4] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
Pattern Recognit., Jun. 2016, pp. 779–788.
[5] S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah,
‘‘Transformers in vision: A survey,’’ ACM Comput. Surv., vol. 54, no. 10,
pp. 1–41, 2022.
[6] P. Cao, Z. Zhu, Z. Wang, Y. Zhu, and Q. Niu, ‘‘Applications of graph con-
volutional networks in computer vision,’’ Neural Comput. Appl., vol. 34,
no. 16, pp. 13387–13405, 2022.
[7] C. Sun, A. Shrivastava, S. Singh, and A. Gupta, ‘‘Revisiting unreasonable
effectiveness of data in deep learning era,’’ in Proc. IEEE Int. Conf.
Comput. Vis. (ICCV), Oct. 2017, pp. 843–852.
[8] D. Karimi, H. Dou, S. K. Warfield, and A. Gholipour, ‘‘Deep learning
with noisy labels: Exploring techniques and remedies in medical image
analysis,’’ Med. Image Anal., vol. 65, Oct. 2020, Art. no. 101759.
[9] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
G. S. Corrado, A. Davis, J. Dean, M. Devin, and S. Ghemawat, ‘‘Ten-
sorFlow: Large-scale machine learning on heterogeneous distributed sys-
tems,’’ 2016, arXiv:1603.04467.
[10] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, [32] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, ‘‘Pyramid scene parsing
Z. Lin, N. Gimelshein, L. Antiga, and A. Desmaison, ‘‘PyTorch: An imper- network,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),
ative style, high-performance deep learning library,’’ in Proc. Adv. Neural Jul. 2017, pp. 2881–2890.
Inf. Process. Syst. (NIPS), vol. 32, 2019, pp. 1–12. [33] J. F. Mullen, F. R. Tanner, and P. A. Sallee, ‘‘Comparing the effects of
[11] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, annotation type on machine learning detection performance,’’ in Proc.
and Z. Zhang, ‘‘MXNet: A flexible and efficient machine learning library IEEE/CVF Conf. Comput. Vis. Pattern Recognit. Workshops (CVPRW),
for heterogeneous distributed systems,’’ 2015, arXiv:1512.01274. Jun. 2019.
[12] T. Liang, H. Bao, W. Pan, and F. Pan, ‘‘Traffic sign detection via improved [34] F. Tanner, B. Colder, C. Pullen, D. Heagy, M. Eppolito, V. Carlan, C. Oertel,
sparse R-CNN for autonomous vehicles,’’ J. Adv. Transp., vol. 2022, and P. Sallee, ‘‘Overhead imagery research data set-an annotated data
pp. 1–16, Mar. 2022. library & tools to aid in the development of computer vision algorithms,’’
[13] C. Eising, J. Horgan, and S. Yogamani, ‘‘Near-field perception for low- in Proc. IEEE Appl. Imag. Pattern Recognit. Workshop (AIPR), Oct. 2009,
speed vehicle automation using surround-view fisheye cameras,’’ IEEE pp. 1–8.
Trans. Intell. Transp. Syst., vol. 23, no. 9, pp. 13976–13993, Sep. 2022. [35] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun,
[14] R. Zhang, L. Wu, Y. Yang, W. Wu, Y. Chen, and M. Xu, ‘‘Multi-camera ‘‘OverFeat: Integrated recognition, localization and detection using convo-
multi-player tracking with deep player identification in sports video,’’ lutional networks,’’ 2013, arXiv:1312.6229.
Pattern Recognit., vol. 102, Jun. 2020, Art. no. 107260. [36] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman,
‘‘The Pascal visual object classes (VOC) challenge,’’ Int. J. Comput. Vis.,
[15] G. Thomas, R. Gade, T. B. Moeslund, P. Carr, and A. Hilton, ‘‘Computer
vol. 88, no. 2, pp. 303–338, Sep. 2009.
vision for sports: Current applications and research topics,’’ Comput. Vis.
[37] D. Acuna, A. Kar, and S. Fidler, ‘‘Devil is in the edges: Learning semantic
Image Understand., vol. 159, pp. 3–18, Jun. 2017.
boundaries from noisy annotations,’’ in Proc. IEEE Conf. Comput. Vis.
[16] A. Esteva, K. Chou, S. Yeung, N. Naik, A. Madani, A. Mottaghi, Y. Liu,
Pattern Recognit., Jun. 2019, pp. 11075–11083.
E. Topol, J. Dean, and R. Socher, ‘‘Deep learning-enabled medical com-
[38] Z. Yu, C. Feng, M.-Y. Liu, and S. Ramalingam, ‘‘CASENet: Deep
puter vision,’’ Npj Digit. Med., vol. 4, no. 1, pp. 1–9, Jan. 2021.
category-aware semantic edge detection,’’ in Proc. IEEE Comput.
[17] J. Gao, Y. Yang, P. Lin, and D. S. Park, ‘‘Computer vision in healthcare Soc. Conf. Comput. Vis. Pattern Recognit., Venice, Italy, Jul. 2017,
applications,’’ J. Healthcare Eng., vol. 2018, Mar. 2018, Art. no. 5157020. pp. 5964–5973.
[18] R. Zhang, L. Xu, Z. Yu, Y. Shi, C. Mu, and M. Xu, ‘‘Deep-IRTarget: [39] B. Hariharan, P. Arbelaez, L. Bourdev, S. Maji, and J. Malik, ‘‘Seman-
An automatic target detector in infrared imagery using dual-domain tic contours from inverse detectors,’’ in Proc. Int. Conf. Comput. Vis.,
feature extraction and allocation,’’ IEEE Trans. Multimedia, vol. 24, Nov. 2011, pp. 991–998.
pp. 1735–1749, 2022. [40] D. Rolnick, A. Veit, S. Belongie, and N. Shavit, ‘‘Deep learning is robust
[19] X. Yang, Y. Ye, X. Li, R. Y. K. Lau, X. Zhang, and X. Huang, ‘‘Hyperspec- to massive label noise,’’ 2017, arXiv:1705.10694.
tral image classification with deep learning models,’’ IEEE Trans. Geosci. [41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, ‘‘ImageNet:
Remote Sens., vol. 56, no. 9, pp. 5408–5423, Sep. 2018. A large-scale hierarchical image database,’’ in Proc. IEEE Conf. Comput.
[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, Vis. Pattern Recognit., Jun. 2009, pp. 248–255.
and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’ [42] Y. LeCun. (1998). The MNIST Database of Handwritten Digits. [Online].
in Proc. Eur. Conf. Comput. Vis. Cham, Switzerland: Springer, 2014, Available: http://yann.lecun.com/exdb/mnist/
pp. 740–755. [43] A. Krizhevsky, ‘‘Learning multiple layers of features from tiny images,’’
[21] Y. Hu, Z. Ou, X. Xu, and M. Song, ‘‘A crowdsourcing repeated annotations Univ. Toronto, Toronto, ON, Canada, Tech. Rep. TR-2009, 2009.
system for visual object detection,’’ in Proc. 3rd Int. Conf. Vis., Image [44] S. Gillies. (2013). The Shapely User Manual. Accessed: Oct. 18, 2022.
Signal Process., Aug. 2019. [Online]. Available: https://pypi.org/project/Shapely
[22] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, [45] B. E. Moore and J. J. Corso. (2020). Fiftyone. [Online]. Available:
A. Karpathy, A. Khosla, M. Bernstein, and A. C. Berg, ‘‘ImageNet large https://github.com/voxel51/fiftyone
scale visual recognition challenge,’’ Int. J. Comput. Vis., vol. 115, no. 3, [46] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu,
pp. 211–252, Dec. 2015. J. Xu, and Z. Zhang, ‘‘MMDetection: Open MMLab detection toolbox and
[23] Y. Xu, L. Zhu, Y. Yang, and F. Wu, ‘‘Training robust object detectors from benchmark,’’ 2019, arXiv:1906.07155.
noisy category labels and imprecise bounding boxes,’’ IEEE Trans. Image [47] K. He, X. Zhang, S. Ren, and J. Sun, ‘‘Deep residual learning for
Process., vol. 30, pp. 5782–5792, 2021. image recognition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,
[24] H. Li, Z. Wu, C. Zhu, C. Xiong, R. Socher, and L. S. Davis, ‘‘Learning from Jul. 2016, pp. 770–778.
noisy anchors for one-stage object detection,’’ in Proc. IEEE/CVF Conf.
Comput. Vis. Pattern Recognit. (CVPR), Jun. 2020, pp. 10588–10597.
[25] Z. Zhang and M. Sabuncu, ‘‘Generalized cross entropy loss for training
deep neural networks with noisy labels,’’ in Proc. Adv. Neural Inf. Process. CATHAOIR AGNEW received the B.S. degree
Syst. (NIPS), vol. 31, 2018, pp. 1–11. in financial mathematics and the M.S. degree in
[26] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, ‘‘MentorNet: Learning artificial intelligence and machine learning from
data-driven curriculum for very deep neural networks on corrupted labels,’’ the University of Limerick, Limerick, Ireland, in
in Proc. Int. Conf. Mach. Learn., Jul. 2018, pp. 2304–2313. 2020 and 2021, respectively, where he is currently
[27] M. Ren, W. Zeng, B. Yang, and R. Urtasun, ‘‘Learning to reweight exam- pursuing the Ph.D. degree in electronic and com-
ples for robust deep learning,’’ in Proc. Int. Conf. Mach. Learn., Jul. 2018, puter engineering. His research interests include
pp. 4334–4343. artificial intelligence and computer vision.
[28] J. Shu, Q. Xie, L. Yi, Q. Zhao, S. Zhou, Z. Xu, and D. Meng, ‘‘Meta-
weight-net: Learning an explicit mapping for sample weighting,’’ in Proc.
Adv. Neural Inf. Process. Syst. (NIPS), vol. 32, 2019, pp. 1–12.
[29] Microsoft COCO. (2021). Detection Evaluation Metrics. Accessed:
Oct. 18, 2022. [Online]. Available: https://cocodataset.org/#detection-eval
CIARÁN EISING (Member, IEEE) received the
B.E. degree in electronic and computer engineer-
[30] V. Taran, Y. Gordienko, A. Rokovyi, O. Alienin, and S. Stirenko, ‘‘Impact
of ground truth annotation quality on performance of semantic image ing and the Ph.D. degree from the National Uni-
segmentation of traffic conditions,’’ in Advances in Computer Science for versity of Ireland, Galway, in 2003 and 2010,
Engineering and Education II, Z. Hu, S. Petoukhov, I. Dychka, and M. He, respectively. From 2009 to 2020, he worked as a
Eds. Cham, Switzerland: Springer, 2020, pp. 183–193. Computer Vision Team Lead and an Architect with
[31] M. Cordts, M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, Valeo Vision Systems, where he also held the title
R. Benenson, U. Franke, S. Roth, and B. Schiele, ‘‘The cityscapes dataset of Senior Expert. In 2016, he held the position of
for semantic urban scene understanding,’’ in Proc. IEEE Conf. Com- Adjunct Lecturer with the National University of
put. Vis. Pattern Recognit. (CVPR), Las Vegas, NV, USA, Jun. 2016, Ireland, Galway. In 2020, he joined the University
pp. 3213–3223. of Limerick, as a Lecturer of artificial intelligence and computer vision.
PATRICK DENNY (Member, IEEE) received the PEPIJN VAN DE VEN received the M.Sc. degree
B.Sc. degree in experimental physics and math- in electronic engineering from the Eindhoven Uni-
ematics from the National University of Ireland versity of Technology, The Netherlands, in 2000,
(NUI), Maynooth, Ireland, in 1993, and the M.Sc. and the Ph.D. degree in artificial intelligence for
degree in mathematics and the Ph.D. degree in autonomous underwater vehicles from the Univer-
physics from the University of Galway, Ireland, sity of Limerick (UL), in 2005. In 2018, he joined
in 1994 and 2000, respectively. He was with GFZ UL’s teaching staff, as a Senior Lecturer in arti-
Potsdam, Germany. From 1999 to 2001, he was an ficial intelligence. His research interests include
RF Engineer with AVM GmbH, Germany, devel- artificial intelligence and machine learning, with
oping the RF hardware for the first integrated a particular interest in medical applications.
GSM/ISDN/USB modem. After working in supercomputing with Compaq-
HP, from 2001 to 2002, he joined Connaught Electronics Ltd. (later Valeo),
Galway, Ireland, as a Team Leader of RF design. For more than 20 years,
he worked as a Lead Engineer, developing novel RF and imaging systems and
led the development of the first mass-production HDR automotive cameras
for leading car companies, including Jaguar Land Rover, BMW, and Daimler.
In 2010, he became an Adjunct Professor of engineering and informatics
with the University of Galway and a Lecturer of artificial intelligence with
the Department of Electronic and Computer Engineering, University of
Limerick, Ireland, in 2022. He is a Co-Founder and a Committee Member
of the IEEE P2020 Automotive Imaging Standards Group, the AutoSens
Conference on Automotive Imaging, and the IS&T Electronic Imaging
Autonomous Vehicles and Machines (AVM) Conference.
EOIN M. GRUA was born in Cork, Ireland,
ANTHONY SCANLAN received the B.Sc. degree in 1993. He received the B.S. degree in liberal arts
in experimental physics from the National Univer- and sciences from Amsterdam University College,
sity of Ireland, Galway, Galway, Ireland, in 1998, Amsterdam, The Netherlands, in 2015, the M.S.
and the M.Eng. and Ph.D. degrees in electronic degree in computer science from Swansea Uni-
engineering from the University of Limerick, versity, Swansea, Wales, in 2016, and the Ph.D.
Limerick, Ireland, in 2001 and 2005, respectively. degree in computer science from Vrije Universiteit
He is currently a Senior Research Fellow with Amsterdam, Amsterdam, in 2021. In 2021, he was
the Department of Electronic and Computer Engi- a Research Assistant with the University of Lim-
neering, University of Limerick, and has been a erick, Limerick, Ireland, where he is currently a
principal investigator for several research projects Postdoctoral Researcher with the Department of Electronic and Computer
in the areas of signal processing and data converter design. His current Engineering. His research interests include artificial intelligence, software
research interests include artificial intelligence, computer vision, and their engineering and architecture, and sustainability.
industrial and environmental applications.