0% found this document useful (0 votes)
35 views

An Optimized Hybrid Transformer-Based Technique For Real-Time Pedestrian Intention Estimation in Autonomous Vehicles

The document proposes a two-branched transformer-based model for predicting pedestrian intentions using both image and non-image data. The model processes image data with a video masked autoencoder and non-image data with a VT encoder. The outputs are fused and fed into an intention classifier to predict crossing or not crossing. Experimental results show the approach outperforms other models.

Uploaded by

karimelkasas99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

An Optimized Hybrid Transformer-Based Technique For Real-Time Pedestrian Intention Estimation in Autonomous Vehicles

The document proposes a two-branched transformer-based model for predicting pedestrian intentions using both image and non-image data. The model processes image data with a video masked autoencoder and non-image data with a VT encoder. The outputs are fused and fed into an intention classifier to predict crossing or not crossing. Experimental results show the approach outperforms other models.

Uploaded by

karimelkasas99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

An Optimized Hybrid Transformer-Based Technique

for Real-Time Pedestrian Intention Estimation in


Autonomous Vehicles
Mohamed Galloul Mariam Aboelwafa Yasser Gadallah
The American University in Cairo NewGiza University The American University in Cairo
Cairo, Egypt 6th of October city, Egypt Cairo, Egypt
[email protected] [email protected] [email protected]

Abstract—The proliferation of autonomous vehicles (AVs) re- The urban environment is considered a challenging scenario
search and technologies have the potential of advancing intelligent as it involves crowded intersections and rushing pedestrians
transportation systems to new horizons. One of the main issues [4]. Moreover, modelling the pedestrian’s crossing intention
that relate to the AV operation is the need to enable these vehicles
to comprehend the behavior of road users the same way as depends on many factors that are not only numerous but also
human drives would normally do. For this purpose, significant dynamic [5]. As a result, the traditional models that rely on
research efforts have focused on the issue of enabling the AVs mathematical analysis will not fit in such a complex and
to predict the intention of pedestrians especially as relates to dynamic scenario due to real-time constraints. Hence, deep
road-crossing. This is done by interpreting some attributes of learning will be more convenient since it provides a stack of
the pedestrians’ behaviors as they approach the areas where
they can potentially attempt to cross the street. In this study, we algorithms that can learn efficiently to extract features from
introduce a novel architecture of a pedestrian intention prediction frames and take highly accurate decisions accordingly in real
model. This architecture includes the processing of the collected time [6]. Fortunately, most of the features that give an indi-
data in two parallel paths. In these two paths, image and non- cation of the pedestrian’s crossing intention are visible. These
image data are processed via two types of transformer-based features include body pose, curb approaching/avoidance, ve-
models and then fused into an intention prediction classifier.
Experimental simulations show that the proposed technique hicle speed, etc. Hence, by capturing successive frames of a
produces significantly better pedestrians’ intention prediction pedestrian, such indicators can be extracted and used to train
results, as compared to leading models from the literature. the AV offline to estimate the pedestrian’s crossing intention.
Index Terms—Autonomous Vehicles, Smart Systems, Pedes- Later, a well-trained AV should be able to process new frames
trian Intention Prediction, Transformers, Machine Learning of different pedestrians and take highly accurate decisions
regarding their intentions in a real-time fashion.
I. I NTRODUCTION In the literature, using deep learning for pedestrians’ in-
tention estimation is progressing at a rapid pace and highly
Autonomous Vehicles (AVs) research activities have wit- accurate results are emerging. Furthermore, there are several
nessed a large boost during the recent decades due to the publicly available datasets that try to capture and label frames
significant advances in related hardware, e.g. cameras and of pedestrians who are crossing and not-crossing the road in
sensors and software, e.g. data analytical algorithms and deep urban situations [5], [7]. More details about the algorithms and
learning techniques [1]. Nevertheless, to realize the full po- datasets in the literature are presented next.
tential of AVs, their operation must ensure the safety of other
road users and particularly pedestrians [2]. Hence, there have A. Related Studies
been increased research focus on understanding the behavior Pedestrian intention estimation, pedestrian action prediction,
of pedestrians, especially that relates to road crossing. or pedestrian behavior anticipation are used interchangeably to
Predicting the pedestrians’ intention as to whether they will refer to predicting pedestrians’ intention, especially for road-
cross the road is not a trivial task. It is affected by many crossing. Many studies use pedestrian-based features, e.g.,
internal, i.e. individual, and external, i.e. contextual, factors pose [8], previous trajectory [9] and head orientation [10] to
[3]. Nevertheless, there are several indicators that can be predict either the pedestrian’s intention or future trajectory.
obtained through the use of cameras and/or sensors and pro- Although relying on pedestrians’ dynamics can yield good
cessed using suitable techniques to enable the AV to capture results, adding contextual information inferred from the scene
the pedestrians’ features and estimate their crossing intention improves performance and reduces overfitting [11].
before the actual crossing occurs. Intention estimation should Features like social interactions with nearby pedestrians in
not only be highly accurate but also done early enough for the crowded scenes can determine the future action of a specific
AV to act accordingly and prevent potential accidents. pedestrian [12]. Yet, different pedestrians in the crowd are
Fig. 1. General Vanilla Transformer Model

assigned different attention weights based on the degree of B. Paper Contribution


impact they have on the target pedestrian [13]. Attention In this paper, we address the pedestrian intention prediction
weights are the values that reflect the importance of different from a novel architectural perspective. The contribution of this
input features for a specific prediction. Furthermore, scene work can be summarized in the following
elements such as zebra crossing, traffic signs and intersections,
• A two-branched transformer-based model for pedestrians’
directly influence the pedestrians’ actions [14]. Additionally,
intention estimation is introduced where
ego-vehicle information, e.g., its speed and distance from the
pedestrian has been used along with pedestrian-related and/or – the first branch handles pedestrians’ cropped images
scene-related features in many studies to better anticipate using a video masked autoencoder [20] to capture the
pedestrians’ intention [5], [7], [15]. motion information from frames, which is crucial for
understanding actions in videos. This will incorpo-
rate it into the encoding process to better capture the
Many data-driven models with different approaches have semantic meaning of the actions.
been presented in the literature for the purpose of predicting – the second branch handles non-image features using
pedestrians’ intention. Convolution Neural Networks (CNNs) a VT encoder [21] to extract a meaningful represen-
have been used in many studies as a feature extractor fol- tation of the input sequences that can capture the
lowed by a classifier [7], [10]. Nonetheless, a CNN with a underlying patterns and dependencies in the data.
fixed kernel size, limits its receptive field thus diminishing • Merging the learned representations from the two
its ability to capture scene complexities [11]. Alternatively, branches is done by concatenation and feeding them
Recurrent Neural Networks (RNNs), specifically Long Short- into an intention classifier to predict the final intention
Term Memory (LSTMs), can capture dependencies in long (crossing/not-crossing). The intention classifier is a two-
sequences to anticipate the pedestrians’ intention [9]. They layer feed forward neural network with a Rectified Linear
can further be coupled with attention mechanisms which allow Unit (ReLU) activation and a dropout rate of 0.5.
the model to focus on a subset of the input data when making
The rest of the paper is organized as follows. The details
predictions rather than equally considering the entire input to
of the problem under consideration, the system model and the
focus on important aspects of the input frame sequence [12].
presented approach are presented in Section II. The perfor-
mance evaluation results are presented in Section III. Finally,
The use of transformer-based models can help achieve the paper is concluded in Section IV.
strong levels of performance in many sequence modeling
problems. In [16], a transformer-based graph convolution II. T HE P ROPOSED A PPROACH
model is introduced to capture complex interactions between In this section, we present a description of the problem
pedestrians to predict their future trajectories. In [17] the at hand and the proposed algorithm. The system model and
authors compare BERT (Bidirectional Encoder Representa- the presented framework are explained in detail. But first, we
tion from Transformers) with a Vanilla Transformer (VT) present a basic technical background of the technique on which
for predicting pedestrians’ trajectories. While VT can only we base our work.
use information from one direction of the sequence when
computing the representation, BERT, on the other hand, pro- A. Background
cesses the input sequence in both directions. This allows The transformer model was introduced in [21]. It is re-
the model to use information from both the past and future ferred to as the Vanilla Transformer (VT). As shown in
context of the sequence when computing the representation Fig. 1, the VT consists of an encoder and a decoder, both
at a cost of computation complexity. The authors of [18] of which are composed of multiple layers that consist of
introduce the CAPformer (Crossing Action Prediciton based a self-attention mechanism and a feedforward network [22].
on Trasfomers), a two-branched architecture that integrates Transformer-based models first appeared in Natural Language
both a VT and TimeSformer, where the TimeSformer is an Processing (NLP). They have been shown to outperform other
approach that adapts the standard transformer architecture to approaches such as recurrent neural networks (RNNs). The
video by enabling spatiotemporal feature learning [19], to transformer models are designed to process sequential data,
predict the pedestrians’ crossing intention. such as natural language, more efficiently than RNNs, by
Fig. 2. The overall structure of the model. The upper branch deals with the image sequence, while the lower branch feeds non-image features to the VT
encoder. The learned representations from both branches are then fused and passed to a FFNN to estimate crossing intentions.

using self-attention mechanisms which are a type of attention 2) The Video Masked AutoEncoder Branch: In this branch,
mechanism that allow the model to dynamically weigh the we use the model that was built on top of the base version
importance of different elements in a sequence of inputs when of the Vision Transformer (ViT) [26], as proposed in [20]. It
making predictions to calculate the relevance of each input is a self-supervised encoder-decoder-based model. During the
element to each output element. It is worth mentioning that pre-training phase, the encoder learns how to represent the
some transformer-based models use only the encoder part input image sequence while capturing the important spatio-
to generate a sequence of representations that capture the temporal features after masking around 90% of the input
meaning and context of the input sequence, like in [18]–[20]. images content. Then, the decoder tries to reconstruct the
Our proposed model also uses type of transformers. input images from the latent space representation and apply a
Transformer-based models have been successfully used in reconstruction loss to exhibit the difference between the input
action prediction tasks such as human pose estimation and and the reconstructed images which helps the encoder to learn
action recognition in videos and images [18], [23]. In these better representations.
tasks, the input is typically a sequence of images or video
frames and the goal is to predict the actions being performed This branch is pre-trained on the Kinetics-400 dataset [27].
in this sequence. One approach to use a transformer-based Then, we use its encoder for fine-tuning using the pedestrian’s
model for these tasks is to treat each image or video frame image sequence.
as an input element and use the self-attention mechanisms in
the model to weigh the importance of the different frames in 3) The Vanilla Transformer Encoder (VTE) Branch: We use
the input sequence [24]. Another approach is to process the the same VT model as proposed in [21]. Yet, similar to the
features extracted from the images or video frames, rather than first branch, we only use the encoder part of the VT. The
the raw pixel data [25]. decoder is mainly used for generative tasks, e.g., machine
translation, question-answering, etc. But we only need the
B. System Model VTE to represent the non-image features, i.e., the bounding
In this subsection, we explain the components of the pro- box coordinates and the ego-vehicle speed.
posed model that is shown in Fig. 2.
1) The Input Features: We can divide the features we use 4) Feature Fusion and Intention Classification: Finally, our
into mainly two categories, namely, image and non-image proposed model combines the learned representations from
features. The cropped-images sequence of the pedestrian is both branches through concatenation and then passes it to the
fed into the video masked auto-encoder branch to produce a intention classifier to predict the probability of intention for
learned representation of the sequence (hstate ). In the parallel both classes, namely, the crossing and not-crossing classes.
branch, the bounding-box coordinates of the pedestrian along The intention classifier is a simple 2-layer feed-forward neural
with the ego-vehicle speed are passed to the VT encoder to network (FFNN) with a dropout of 0.5 and a ReLU activation
be embedded and represented as a vector. in between.
C. Training loss expression. In LW eighted−CE (ŷ, y), we assign different
In order to train the proposed model, we need a large weights (WC , for the crossing class (C), and WN C , for the
dataset of video frames that the capture pedestrian’s behavior non-crossing class (NC), to make the less-represented class
in various scenarios and the corresponding non-image features contribute equally, during the learning process, to the loss
(labels) such as the bounding-boxes coordinates and the ego- function as the major class. As a result, the weighted cross-
vehicle speed. Using such a dataset, the model is trained entropy loss function, LW eighted−CE (ŷ, y), is formulated as:
end-to-end to minimize a loss function that represents the
difference between the model’s predicted output, i.e., the LW eighted−CE (ŷ, y) = WC LC (ŷ, yC ) + WN C LN C (ŷ, yN C ),
estimated pedestrian’s intention, and the true output, i.e., the (6)
ground truth intentions. The dataset and the loss function are where,
discussed next. LC denotes the loss when the target class was crossing,
1) Dataset: We train the model using the Pedestrian Inten- LN C denotes the loss when the target class was not-crossing,
tion Estimation (PIE) dataset [5]. yC is the crossing ground truth,
This dataset includes 1842 pedestrians, 1322 pf which did yN C is the not-crossing ground truth,
not have the intention of crossing while the rest did. We also
NN C
followed the same set split for the training, validation, and WC = NC +NN C ,
test purposes, as proposed in [15]. The tracking length for NC
each pedestrian is 16 frames whereas the last frame is 1-2 sec WN C = NC +NN C ,
(30-60 frames) prior to the crossing/not-crossing action. NC denotes the number of samples with a crossing label,
2) Data Preprocessing: Since we are using multiple NN C denotes the number of samples with a not-crossing label.
features with different ranges, data normalization is a 4) Metrics: We report our performance evaluation results
necessity. We apply z-standardization on the cropped images using the f 1score and the area under the Receiver Operating
using the mean and standard deviation of ImageNet, which is Characteristics (ROC) curve (AU C). These two metrics suit
a large-scale image dataset widely used in computer vision imbalanced data best since reporting the accuracy in such cases
applications [28], and also on the ego-vehicle speed using its can be misleading [29]. The f1-score is defined as
mean and standard deviation from the training set.
P recision × Recall
f 1score = 2 × , (7)
The ego-vehicle speed is given by P recision + Recall
υ − µspeed where P recision is defined as the ratio of the number of
υstandarized = (1) true positives to the sum of true positives and false positives.
σspeed
P recision represents the ability of the classifier to avoid
The bounding-box coordinates are as follows false positive predictions. Recall is defined as the ratio of
x − xmin the number of true positives to the sum of true positives and
xnormalized = ; (2) false negatives. Recall represents the ability of the classifier
xmax − xmin
y − ymin to find all positive instances.
ynormalized = (3) However, the ROC curve is a plot for the false positive rate
ymax − ymin
versus the true positive rate at multiple thresholds. The AUC,
where, xmin = ymin = 0 and xmax = 1920, ymax = 1080.
which is the area under the ROC curve, is a good indicator of
3) The Loss Function: The conventional cross-entropy loss
a classifier’s performance. The higher the AUC of a classifier,
(LCE ) is used for our task to measure the difference between
the more reliable it is.
the predicted probability distribution of the model and the true
The detailed procedure of the proposed approach is illus-
probability distribution of the labels. Generally, the expression
trated in Algorithm 1.
of LCE can be formulated as follows:
n
III. P ERFORMANCE E VALUATION
X
LCE (ŷ, y) = − yi log(pi ) (4) In this section, we present an evaluation of the performance
i of the proposed approach by analyzing various metrics and
exp(xi ) comparing them against state-of-the-art results. As mentioned
pi = Pn (5) earlier, we adopt the f 1score and the area under the ROC curve
j exp(xj ) (AU C). The state-of-the-art results that we consider in the
where, n is the number of classes, yi is the ground truth label, comparison are the CAPformer [18] and the PCPA (Pedestrian
pi is the Softmax probability of the ith class, and x is the Crossing Predicition with Attention) [15]. Both approaches
logits output vector of the model. report the highest f 1score and AU C on the same dataset [5].
In the problem under consideration, we need to account For the setup of our simulation environment, we tried multi-
for the imbalance between the two classes in the dataset. For ple hyper-parameters to achieve the best possible performance.
this purpose, we use a weighted cross-entropy loss function We list the hyper-parameters that we used to get the best
(LW eighted−CE (ŷ, y)) instead of the general cross-entropy performance of our model in TABLE I.
Algorithm 1 Transfomer-based Pedestrian Intention Estima- TABLE I
tion S IMULATION PARAMETERS
1: procedure DATA P REPROCESSING
2: Input: Raw Data Vanilla Transformer Encoder
3: Crop Pedestrian Images into (224 x 224) images using Parameter Value
Bounding-Box coordinates. Embedding dimension (dmodel ) 256
4: Standardize ego-vehicle speed using z-standardization. Number of encoder layers 2
5: Normalize both bounding-box coordinates and cropped im- Number of heads 4
ages using min-max normalization. Dropout rate 0.1
6: end procedure MLP hidden layer dimension 384
7: procedure DATA S PLIT
8: Input: Processed Data. Video Masked Autoencoder
9: Set batchSize. Parameter Value
10: Define trainSet: training dataset. Last hidden state size (hstate ) 768
11: Define validationSet: Validation dataset. Input image size 224 x 224
12: Define testSet: testing dataset. Pretrained Weights Kinetics 400
13: Set datasplits to [trainSet, validationSet, testSet] . Fusion Parameters
14: for dataSplit in dataSplits do
15: Create a dataloader for dataSplit with a batch size of Parameter Value
batchSize. Dropout rate 0.5
16: end for Activation fn. ReLU
17: end procedure General Parameters
18: procedure T RAINING Parameter Value
19: Input: trainLoader and validationLoader
20: Set hyperparameters. Sequence length(N) 16
21: for Each training epoch do Learning rate 1e-3
22: Set predictions to the output of our hybrid model. Learning rate scheduler Cosine Decay
23: Calculate train metrics that are, weighted CE-loss, accu- Batch Size (B) 14
racy, precision, recall, and f1-score. Epochs 20
24: Update the model weights with the optimizer step. Cropping Strategy Local box warp [18]
25: end for Optimizer AdamW
26: Calculate validation metrics. Weight decay 0.05
27: if Validation Loss < previous best model loss then
28: Set best model to the current model.
29: else
30: Change hyperparameters.
compared to A100 24GB GPU used in [18]). This was
31: Repeat Training. achieved by using the 8-bit version of AdamW rather than
32: end if the conventional version of AdamW which can save up to
33: Export best model weights for future inference. 75% of GPU memory utilized by the optimizer.
34: end procedure Our proposed model can, therefore, pave the way for
35: procedure T ESTING
36: Input: testLoader future transformer-based models for similar scenarios with
37: Load best model. moderate hardware capabilities while still benefiting from the
38: Set test Predictions to the output of the best model. transformer’s parallelization and long-dependencies attention.
39: Calculate test metrics.
40: Output: test metrics IV. C ONCLUSION
41: end procedure
The area of autonomous vehicles research is currently thriv-
ing due to the large advances in the areas of hardware, such
as those related to sensing and video recording, and software,
We report the results of the performance evaluation of our such as the artificial intelligence techniques. There is a crucial
technique based on the selected aforementioned metrics on need to enable the AVs, as machines, to accurately compre-
the test set of the PIE dataset in comparison with the other hend the behavior of road users. One of the major requirements
selected baseline models in literature. TABLE II shows the in this regard is pedestrian intention prediction especially as it
results. relates to road-crossing. In this study, we introduced a novel
As illustrated in TABLE II, our proposed model outperforms intention prediction model architecture that enables the AV to
both the CAPformer and the PCPA models. This is due to its predict the intention of the pedestrians on whether they will
ability to capture more complex spatio-temporal patterns in cross the street. The model is based on fusing the transformer-
video data. The encoder of the upper branch was able to learn processed data, namely, the non-image data streams with the
the important features of the sequence. This allowed our model image data into a classier that then produces the required
to learn representations that are more compact and informative pedestrian intention prediction decisions. This is done while
compared to other models. optimizing the utilized computation resources thus avoiding
Nevertheless, our model is able to achieve this higher the need to use highly sophisticated computing resources to
performance with a moderate-size GPU (RTX 2080 TI 11GB, reach proper conclusions. Experimental results show that the
TABLE II
R ESULTS OF DIFFERENT MODELS ON THE TEST SET. I DENOTES THE PEDESTRIAN CROPPED IMAGE SEQUENCE , P DENOTES THE PEDESTRIAN ’ S POSE ,
BB DENOTES THE PEDESTRIAN ’ S BOUNDING BOX AND S DENOTES THE EGO - VEHICLE SPEED .

Model Backbone Features Params F1-score AUC


PCPA [15] C3D I, P , BB, S 31M 0.770 0.86
CAPformer [18] TimeSformer [19] I, BB, S 123M 0.779 0.853
RubiksNet I, BB, S 8M 0.749 0.839

Our model Video Masked AutoEncoder [20] I, BB, S 89M 0.843 0.914

proposed technique produces significantly better results than [14] Amir Rasouli, Iuliia Kotseruba, and John K. Tsotsos, “Understanding
those of leading models from the literature. These results pave pedestrian behavior in complex traffic scenes,” IEEE Transactions on
Intelligent Vehicles, vol. 3, no. 1, pp. 61–70, 2018.
the way towards expanding this architecture to include other [15] Iuliia Kotseruba, Amir Rasouli, and John K Tsotsos, “Benchmark
formations of the input data. for evaluating pedestrian action prediction,” in Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision,
2021, pp. 1258–1268.
R EFERENCES [16] Cunjun Yu, Xiao Ma, Jiawei Ren, Haiyu Zhao, and Shuai Yi, “Spatio-
temporal graph transformer networks for pedestrian trajectory predic-
[1] Mahir Gulzar, Yar Muhammad, and Naveed Muhammad, “A survey on tion,” in European Conference on Computer Vision. Springer, 2020, pp.
motion prediction of pedestrians and vehicles for autonomous driving,” 507–523.
IEEE Access, 2021. [17] Francesco Giuliari, Irtiza Hasan, Marco Cristani, and Fabio Galasso,
“Transformer Networks for Trajectory Forecasting,” arXiv, Mar. 2020.
[2] Khaled Saleh, “Pedestrian trajectory prediction for real-time autonomous
[18] Javier Lorenzo, Ignacio Parra Alonso, Rubén Izquierdo, Augusto Luis
systems via context-augmented transformer networks,” Sensors, vol. 22,
Ballardini, Álvaro Hernández Saz, David Fernández Llorca, and
no. 19, pp. 7495, 2022.
Miguel Ángel Sotelo, “Capformer: Pedestrian crossing action prediction
[3] Sirin Haddad, Meiqing Wu, He Wei, and Siew Kei Lam, “Situation-
using transformer,” Sensors, vol. 21, no. 17, pp. 5694, 2021.
aware pedestrian trajectory prediction with spatio-temporal attention
[19] Gedas Bertasius, Heng Wang, and Lorenzo Torresani, “Is Space-Time
model,” arXiv preprint arXiv:1902.05437, 2019.
Attention All You Need for Video Understanding?,” arXiv, Feb. 2021.
[4] Suresh Kumaar Jayaraman, Lionel P. Robert, X. Jessie Yang, and [20] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang, “Videomae:
Dawn M. Tilbury, “Multimodal hybrid pedestrian: A hybrid automaton Masked autoencoders are data-efficient learners for self-supervised video
model of urban pedestrian behavior for automated driving applications,” pre-training,” arXiv preprint arXiv:2203.12602, 2022.
IEEE Access, vol. 9, pp. 27708–27722, 2021. [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
[5] Amir Rasouli, Iuliia Kotseruba, Toni Kunic, and John K Tsotsos, “Pie: Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention
A large-scale dataset and models for pedestrian intention estimation and is all you need,” in Advances in neural information processing systems,
trajectory prediction,” in Proceedings of the IEEE/CVF International 2017, pp. 5998–6008.
Conference on Computer Vision, 2019, pp. 6262–6271. [22] Yuxin Fang, Shusheng Yang, Shijie Wang, Yixiao Ge, Ying Shan, and
[6] Ajay Shrestha and Ausif Mahmood, “Review of deep learning algo- Xinggang Wang, “Unleashing vanilla vision transformer with masked
rithms and architectures,” IEEE access, vol. 7, pp. 53040–53065, 2019. image modeling for object detection,” arXiv preprint arXiv:2204.02964,
[7] Amir Rasouli, Iuliia Kotseruba, and John K Tsotsos, “Are they going 2022.
to cross? a benchmark dataset and baseline for pedestrian crosswalk [23] Lina Achaji, Julien Moreau, Thibault Fouqueray, Francois Aioun, and
behavior,” in Proceedings of the IEEE International Conference on François Charpillet, “Is attention to bounding boxes all you need
Computer Vision Workshops, 2017, pp. 206–213. for pedestrian action prediction?,” in 2022 IEEE Intelligent Vehicles
[8] Raúl Quintero Mı́nguez, Ignacio Parra Alonso, David Fernández-Llorca, Symposium (IV). IEEE, 2022, pp. 895–902.
and Miguel Ángel Sotelo, “Pedestrian path, pose, and intention predic- [24] Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian,
tion through gaussian process dynamical models and pedestrian activity Yang Zhang, Zhongchao Shi, Jianping Fan, and Zhiqiang He, “A survey
recognition,” IEEE Transactions on Intelligent Transportation Systems, of visual transformers,” arXiv preprint arXiv:2111.06091, 2021.
vol. 20, no. 5, pp. 1803–1814, 2019. [25] J Lorenzo, I Parra, and MA Sotelo, “Intformer: Predicting pedestrian
[9] Khaled Saleh, Mohammed Hossny, and Saeid Nahavandi, “Intent intention with the aid of the transformer architecture,” arXiv preprint
prediction of pedestrians via motion trajectories using stacked recurrent arXiv:2105.08647, 2021.
neural networks,” IEEE Transactions on Intelligent Vehicles, vol. 3, no. [26] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis-
4, pp. 414–424, 2018. senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani,
[10] Dimitrios Varytimidis, Fernando Alonso-Fernandez, Boris Duran, and Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and
Cristofer Englund, “Action and intention recognition of pedestrians in Neil Houlsby, “An Image is Worth 16x16 Words: Transformers for
urban traffic,” in 2018 14th International Conference on Signal-Image Image Recognition at Scale,” arXiv, Oct. 2020.
Technology Internet-Based Systems (SITIS), 2018, pp. 676–682. [27] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier,
[11] Neha Sharma, Chhavi Dhiman, and S. Indu, “Pedestrian Intention Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back,
Prediction for Autonomous Vehicles: A Comprehensive Survey,” Neu- Paul Natsev, et al., “The kinetics human action video dataset,” arXiv
rocomputing, vol. 508, pp. 120–152, Oct. 2022. preprint arXiv:1705.06950, 2017.
[12] Tharindu Fernando, Simon Denman, Sridha Sridharan, and Clinton [28] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei,
Fookes, “Soft + Hardwired attention: An LSTM framework for human “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE
trajectory prediction and abnormal event detection,” Neural Networks, conference on computer vision and pattern recognition. Ieee, 2009, pp.
vol. 108, pp. 466–478, Dec. 2018. 248–255.
[13] Yi Fang, Yize Li, Asam Ahmed, and Siming You, “Development, [29] László A Jeni, Jeffrey F Cohn, and Fernando De La Torre, “Facing
economics and global warming potential of lignocellulose biorefinery,” imbalanced data–recommendations for the use of performance metrics,”
in Biomass, Biofuels, Biochemicals, pp. 1–13. Elsevier, Walthm, MA, in 2013 Humaine association conference on affective computing and
USA, Jan. 2021. intelligent interaction. IEEE, 2013, pp. 245–251.

You might also like