An Optimized Hybrid Transformer-Based Technique For Real-Time Pedestrian Intention Estimation in Autonomous Vehicles

The document proposes a two-branched transformer-based model for predicting pedestrian intentions using both image and non-image data. The model processes image data with a video masked autoencoder and non-image data with a VT encoder. The outputs are fused and fed into an intention classifier to predict crossing or not crossing. Experimental results show the approach outperforms other models.

Uploaded by

karimelkasas99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

An Optimized Hybrid Transformer-Based Technique For Real-Time Pedestrian Intention Estimation in Autonomous Vehicles

Uploaded by

karimelkasas99

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

An Optimized Hybrid Transformer-Based Technique

for Real-Time Pedestrian Intention Estimation in

Autonomous Vehicles
Mohamed Galloul Mariam Aboelwafa Yasser Gadallah
The American University in Cairo NewGiza University The American University in Cairo
Cairo, Egypt 6th of October city, Egypt Cairo, Egypt
[email protected] [email protected] [email protected]

Abstract—The proliferation of autonomous vehicles (AVs) re- The urban environment is considered a challenging scenario
search and technologies have the potential of advancing intelligent as it involves crowded intersections and rushing pedestrians
transportation systems to new horizons. One of the main issues [4]. Moreover, modelling the pedestrian’s crossing intention
that relate to the AV operation is the need to enable these vehicles
to comprehend the behavior of road users the same way as depends on many factors that are not only numerous but also
human drives would normally do. For this purpose, significant dynamic [5]. As a result, the traditional models that rely on
research efforts have focused on the issue of enabling the AVs mathematical analysis will not fit in such a complex and
to predict the intention of pedestrians especially as relates to dynamic scenario due to real-time constraints. Hence, deep
road-crossing. This is done by interpreting some attributes of learning will be more convenient since it provides a stack of
the pedestrians’ behaviors as they approach the areas where
they can potentially attempt to cross the street. In this study, we algorithms that can learn efficiently to extract features from
introduce a novel architecture of a pedestrian intention prediction frames and take highly accurate decisions accordingly in real
model. This architecture includes the processing of the collected time [6]. Fortunately, most of the features that give an indi-
data in two parallel paths. In these two paths, image and non- cation of the pedestrian’s crossing intention are visible. These
image data are processed via two types of transformer-based features include body pose, curb approaching/avoidance, ve-
models and then fused into an intention prediction classifier.
Experimental simulations show that the proposed technique hicle speed, etc. Hence, by capturing successive frames of a
produces significantly better pedestrians’ intention prediction pedestrian, such indicators can be extracted and used to train
results, as compared to leading models from the literature. the AV offline to estimate the pedestrian’s crossing intention.
Index Terms—Autonomous Vehicles, Smart Systems, Pedes- Later, a well-trained AV should be able to process new frames
trian Intention Prediction, Transformers, Machine Learning of different pedestrians and take highly accurate decisions
regarding their intentions in a real-time fashion.
I. I NTRODUCTION In the literature, using deep learning for pedestrians’ in-
tention estimation is progressing at a rapid pace and highly
Autonomous Vehicles (AVs) research activities have wit- accurate results are emerging. Furthermore, there are several
nessed a large boost during the recent decades due to the publicly available datasets that try to capture and label frames
significant advances in related hardware, e.g. cameras and of pedestrians who are crossing and not-crossing the road in
sensors and software, e.g. data analytical algorithms and deep urban situations [5], [7]. More details about the algorithms and
learning techniques [1]. Nevertheless, to realize the full po- datasets in the literature are presented next.
tential of AVs, their operation must ensure the safety of other
road users and particularly pedestrians [2]. Hence, there have A. Related Studies
been increased research focus on understanding the behavior Pedestrian intention estimation, pedestrian action prediction,
of pedestrians, especially that relates to road crossing. or pedestrian behavior anticipation are used interchangeably to
Predicting the pedestrians’ intention as to whether they will refer to predicting pedestrians’ intention, especially for road-
cross the road is not a trivial task. It is affected by many crossing. Many studies use pedestrian-based features, e.g.,
internal, i.e. individual, and external, i.e. contextual, factors pose [8], previous trajectory [9] and head orientation [10] to
[3]. Nevertheless, there are several indicators that can be predict either the pedestrian’s intention or future trajectory.
obtained through the use of cameras and/or sensors and pro- Although relying on pedestrians’ dynamics can yield good
cessed using suitable techniques to enable the AV to capture results, adding contextual information inferred from the scene
the pedestrians’ features and estimate their crossing intention improves performance and reduces overfitting [11].
before the actual crossing occurs. Intention estimation should Features like social interactions with nearby pedestrians in
not only be highly accurate but also done early enough for the crowded scenes can determine the future action of a specific
AV to act accordingly and prevent potential accidents. pedestrian [12]. Yet, different pedestrians in the crowd are
Fig. 1. General Vanilla Transformer Model

assigned different attention weights based on the degree of B. Paper Contribution

impact they have on the target pedestrian [13]. Attention In this paper, we address the pedestrian intention prediction
weights are the values that reflect the importance of different from a novel architectural perspective. The contribution of this
input features for a specific prediction. Furthermore, scene work can be summarized in the following
elements such as zebra crossing, traffic signs and intersections,
• A two-branched transformer-based model for pedestrians’
directly influence the pedestrians’ actions [14]. Additionally,
intention estimation is introduced where
ego-vehicle information, e.g., its speed and distance from the
pedestrian has been used along with pedestrian-related and/or – the first branch handles pedestrians’ cropped images
scene-related features in many studies to better anticipate using a video masked autoencoder [20] to capture the
pedestrians’ intention [5], [7], [15]. motion information from frames, which is crucial for
understanding actions in videos. This will incorpo-
rate it into the encoding process to better capture the
Many data-driven models with different approaches have semantic meaning of the actions.
been presented in the literature for the purpose of predicting – the second branch handles non-image features using
pedestrians’ intention. Convolution Neural Networks (CNNs) a VT encoder [21] to extract a meaningful represen-
have been used in many studies as a feature extractor fol- tation of the input sequences that can capture the
lowed by a classifier [7], [10]. Nonetheless, a CNN with a underlying patterns and dependencies in the data.
fixed kernel size, limits its receptive field thus diminishing • Merging the learned representations from the two
its ability to capture scene complexities [11]. Alternatively, branches is done by concatenation and feeding them
Recurrent Neural Networks (RNNs), specifically Long Short- into an intention classifier to predict the final intention
Term Memory (LSTMs), can capture dependencies in long (crossing/not-crossing). The intention classifier is a two-
sequences to anticipate the pedestrians’ intention [9]. They layer feed forward neural network with a Rectified Linear
can further be coupled with attention mechanisms which allow Unit (ReLU) activation and a dropout rate of 0.5.
the model to focus on a subset of the input data when making
The rest of the paper is organized as follows. The details
predictions rather than equally considering the entire input to
of the problem under consideration, the system model and the
focus on important aspects of the input frame sequence [12].
presented approach are presented in Section II. The perfor-
mance evaluation results are presented in Section III. Finally,
The use of transformer-based models can help achieve the paper is concluded in Section IV.
strong levels of performance in many sequence modeling
problems. In [16], a transformer-based graph convolution II. T HE P ROPOSED A PPROACH
model is introduced to capture complex interactions between In this section, we present a description of the problem
pedestrians to predict their future trajectories. In [17] the at hand and the proposed algorithm. The system model and
authors compare BERT (Bidirectional Encoder Representa- the presented framework are explained in detail. But first, we
tion from Transformers) with a Vanilla Transformer (VT) present a basic technical background of the technique on which
for predicting pedestrians’ trajectories. While VT can only we base our work.
use information from one direction of the sequence when
computing the representation, BERT, on the other hand, pro- A. Background
cesses the input sequence in both directions. This allows The transformer model was introduced in [21]. It is re-
the model to use information from both the past and future ferred to as the Vanilla Transformer (VT). As shown in
context of the sequence when computing the representation Fig. 1, the VT consists of an encoder and a decoder, both
at a cost of computation complexity. The authors of [18] of which are composed of multiple layers that consist of
introduce the CAPformer (Crossing Action Prediciton based a self-attention mechanism and a feedforward network [22].
on Trasfomers), a two-branched architecture that integrates Transformer-based models first appeared in Natural Language
both a VT and TimeSformer, where the TimeSformer is an Processing (NLP). They have been shown to outperform other
approach that adapts the standard transformer architecture to approaches such as recurrent neural networks (RNNs). The
video by enabling spatiotemporal feature learning [19], to transformer models are designed to process sequential data,
predict the pedestrians’ crossing intention. such as natural language, more efficiently than RNNs, by
Fig. 2. The overall structure of the model. The upper branch deals with the image sequence, while the lower branch feeds non-image features to the VT
encoder. The learned representations from both branches are then fused and passed to a FFNN to estimate crossing intentions.

using self-attention mechanisms which are a type of attention 2) The Video Masked AutoEncoder Branch: In this branch,
mechanism that allow the model to dynamically weigh the we use the model that was built on top of the base version
importance of different elements in a sequence of inputs when of the Vision Transformer (ViT) [26], as proposed in [20]. It
making predictions to calculate the relevance of each input is a self-supervised encoder-decoder-based model. During the
element to each output element. It is worth mentioning that pre-training phase, the encoder learns how to represent the
some transformer-based models use only the encoder part input image sequence while capturing the important spatio-
to generate a sequence of representations that capture the temporal features after masking around 90% of the input
meaning and context of the input sequence, like in [18]–[20]. images content. Then, the decoder tries to reconstruct the
Our proposed model also uses type of transformers. input images from the latent space representation and apply a
Transformer-based models have been successfully used in reconstruction loss to exhibit the difference between the input
action prediction tasks such as human pose estimation and and the reconstructed images which helps the encoder to learn
action recognition in videos and images [18], [23]. In these better representations.
tasks, the input is typically a sequence of images or video
frames and the goal is to predict the actions being performed This branch is pre-trained on the Kinetics-400 dataset [27].
in this sequence. One approach to use a transformer-based Then, we use its encoder for fine-tuning using the pedestrian’s
model for these tasks is to treat each image or video frame image sequence.
as an input element and use the self-attention mechanisms in
the model to weigh the importance of the different frames in 3) The Vanilla Transformer Encoder (VTE) Branch: We use
the input sequence [24]. Another approach is to process the the same VT model as proposed in [21]. Yet, similar to the
features extracted from the images or video frames, rather than first branch, we only use the encoder part of the VT. The
the raw pixel data [25]. decoder is mainly used for generative tasks, e.g., machine
translation, question-answering, etc. But we only need the
B. System Model VTE to represent the non-image features, i.e., the bounding
In this subsection, we explain the components of the pro- box coordinates and the ego-vehicle speed.
posed model that is shown in Fig. 2.
1) The Input Features: We can divide the features we use 4) Feature Fusion and Intention Classification: Finally, our
into mainly two categories, namely, image and non-image proposed model combines the learned representations from
features. The cropped-images sequence of the pedestrian is both branches through concatenation and then passes it to the
fed into the video masked auto-encoder branch to produce a intention classifier to predict the probability of intention for
learned representation of the sequence (hstate ). In the parallel both classes, namely, the crossing and not-crossing classes.
branch, the bounding-box coordinates of the pedestrian along The intention classifier is a simple 2-layer feed-forward neural
with the ego-vehicle speed are passed to the VT encoder to network (FFNN) with a dropout of 0.5 and a ReLU activation
be embedded and represented as a vector. in between.
C. Training loss expression. In LW eighted−CE (ŷ, y), we assign different
In order to train the proposed model, we need a large weights (WC , for the crossing class (C), and WN C , for the
dataset of video frames that the capture pedestrian’s behavior non-crossing class (NC), to make the less-represented class
in various scenarios and the corresponding non-image features contribute equally, during the learning process, to the loss
(labels) such as the bounding-boxes coordinates and the ego- function as the major class. As a result, the weighted cross-
vehicle speed. Using such a dataset, the model is trained entropy loss function, LW eighted−CE (ŷ, y), is formulated as:
end-to-end to minimize a loss function that represents the
difference between the model’s predicted output, i.e., the LW eighted−CE (ŷ, y) = WC LC (ŷ, yC ) + WN C LN C (ŷ, yN C ),
estimated pedestrian’s intention, and the true output, i.e., the (6)
ground truth intentions. The dataset and the loss function are where,
discussed next. LC denotes the loss when the target class was crossing,
1) Dataset: We train the model using the Pedestrian Inten- LN C denotes the loss when the target class was not-crossing,
tion Estimation (PIE) dataset [5]. yC is the crossing ground truth,
This dataset includes 1842 pedestrians, 1322 pf which did yN C is the not-crossing ground truth,
not have the intention of crossing while the rest did. We also
NN C
followed the same set split for the training, validation, and WC = NC +NN C ,
test purposes, as proposed in [15]. The tracking length for NC
each pedestrian is 16 frames whereas the last frame is 1-2 sec WN C = NC +NN C ,
(30-60 frames) prior to the crossing/not-crossing action. NC denotes the number of samples with a crossing label,
2) Data Preprocessing: Since we are using multiple NN C denotes the number of samples with a not-crossing label.
features with different ranges, data normalization is a 4) Metrics: We report our performance evaluation results
necessity. We apply z-standardization on the cropped images using the f 1score and the area under the Receiver Operating
using the mean and standard deviation of ImageNet, which is Characteristics (ROC) curve (AU C). These two metrics suit
a large-scale image dataset widely used in computer vision imbalanced data best since reporting the accuracy in such cases
applications [28], and also on the ego-vehicle speed using its can be misleading [29]. The f1-score is defined as
mean and standard deviation from the training set.
P recision × Recall
f 1score = 2 × , (7)
The ego-vehicle speed is given by P recision + Recall
υ − µspeed where P recision is defined as the ratio of the number of
υstandarized = (1) true positives to the sum of true positives and false positives.
σspeed
P recision represents the ability of the classifier to avoid
The bounding-box coordinates are as follows false positive predictions. Recall is defined as the ratio of
x − xmin the number of true positives to the sum of true positives and
xnormalized = ; (2) false negatives. Recall represents the ability of the classifier
xmax − xmin
y − ymin to find all positive instances.
ynormalized = (3) However, the ROC curve is a plot for the false positive rate
ymax − ymin
versus the true positive rate at multiple thresholds. The AUC,
where, xmin = ymin = 0 and xmax = 1920, ymax = 1080.
which is the area under the ROC curve, is a good indicator of
3) The Loss Function: The conventional cross-entropy loss
a classifier’s performance. The higher the AUC of a classifier,
(LCE ) is used for our task to measure the difference between
the more reliable it is.
the predicted probability distribution of the model and the true
The detailed procedure of the proposed approach is illus-
probability distribution of the labels. Generally, the expression
trated in Algorithm 1.
of LCE can be formulated as follows:
n
III. P ERFORMANCE E VALUATION
X
LCE (ŷ, y) = − yi log(pi ) (4) In this section, we present an evaluation of the performance
i of the proposed approach by analyzing various metrics and
exp(xi ) comparing them against state-of-the-art results. As mentioned
pi = Pn (5) earlier, we adopt the f 1score and the area under the ROC curve
j exp(xj ) (AU C). The state-of-the-art results that we consider in the
where, n is the number of classes, yi is the ground truth label, comparison are the CAPformer [18] and the PCPA (Pedestrian
pi is the Softmax probability of the ith class, and x is the Crossing Predicition with Attention) [15]. Both approaches
logits output vector of the model. report the highest f 1score and AU C on the same dataset [5].
In the problem under consideration, we need to account For the setup of our simulation environment, we tried multi-
for the imbalance between the two classes in the dataset. For ple hyper-parameters to achieve the best possible performance.
this purpose, we use a weighted cross-entropy loss function We list the hyper-parameters that we used to get the best
(LW eighted−CE (ŷ, y)) instead of the general cross-entropy performance of our model in TABLE I.
Algorithm 1 Transfomer-based Pedestrian Intention Estima- TABLE I
tion S IMULATION PARAMETERS
1: procedure DATA P REPROCESSING
2: Input: Raw Data Vanilla Transformer Encoder
3: Crop Pedestrian Images into (224 x 224) images using Parameter Value
Bounding-Box coordinates. Embedding dimension (dmodel ) 256
4: Standardize ego-vehicle speed using z-standardization. Number of encoder layers 2
5: Normalize both bounding-box coordinates and cropped im- Number of heads 4
ages using min-max normalization. Dropout rate 0.1
6: end procedure MLP hidden layer dimension 384
7: procedure DATA S PLIT
8: Input: Processed Data. Video Masked Autoencoder
9: Set batchSize. Parameter Value
10: Define trainSet: training dataset. Last hidden state size (hstate ) 768
11: Define validationSet: Validation dataset. Input image size 224 x 224
12: Define testSet: testing dataset. Pretrained Weights Kinetics 400
13: Set datasplits to [trainSet, validationSet, testSet] . Fusion Parameters
14: for dataSplit in dataSplits do
15: Create a dataloader for dataSplit with a batch size of Parameter Value
batchSize. Dropout rate 0.5
16: end for Activation fn. ReLU
17: end procedure General Parameters
18: procedure T RAINING Parameter Value
19: Input: trainLoader and validationLoader
20: Set hyperparameters. Sequence length(N) 16
21: for Each training epoch do Learning rate 1e-3
22: Set predictions to the output of our hybrid model. Learning rate scheduler Cosine Decay
23: Calculate train metrics that are, weighted CE-loss, accu- Batch Size (B) 14
racy, precision, recall, and f1-score. Epochs 20
24: Update the model weights with the optimizer step. Cropping Strategy Local box warp [18]
25: end for Optimizer AdamW
26: Calculate validation metrics. Weight decay 0.05
27: if Validation Loss < previous best model loss then
28: Set best model to the current model.
29: else
30: Change hyperparameters.
compared to A100 24GB GPU used in [18]). This was
31: Repeat Training. achieved by using the 8-bit version of AdamW rather than
32: end if the conventional version of AdamW which can save up to
33: Export best model weights for future inference. 75% of GPU memory utilized by the optimizer.
34: end procedure Our proposed model can, therefore, pave the way for
35: procedure T ESTING
36: Input: testLoader future transformer-based models for similar scenarios with
37: Load best model. moderate hardware capabilities while still benefiting from the
38: Set test Predictions to the output of the best model. transformer’s parallelization and long-dependencies attention.
39: Calculate test metrics.
40: Output: test metrics IV. C ONCLUSION
41: end procedure
The area of autonomous vehicles research is currently thriv-
ing due to the large advances in the areas of hardware, such
as those related to sensing and video recording, and software,
We report the results of the performance evaluation of our such as the artificial intelligence techniques. There is a crucial
technique based on the selected aforementioned metrics on need to enable the AVs, as machines, to accurately compre-
the test set of the PIE dataset in comparison with the other hend the behavior of road users. One of the major requirements
selected baseline models in literature. TABLE II shows the in this regard is pedestrian intention prediction especially as it
results. relates to road-crossing. In this study, we introduced a novel
As illustrated in TABLE II, our proposed model outperforms intention prediction model architecture that enables the AV to
both the CAPformer and the PCPA models. This is due to its predict the intention of the pedestrians on whether they will
ability to capture more complex spatio-temporal patterns in cross the street. The model is based on fusing the transformer-
video data. The encoder of the upper branch was able to learn processed data, namely, the non-image data streams with the
the important features of the sequence. This allowed our model image data into a classier that then produces the required
to learn representations that are more compact and informative pedestrian intention prediction decisions. This is done while
compared to other models. optimizing the utilized computation resources thus avoiding
Nevertheless, our model is able to achieve this higher the need to use highly sophisticated computing resources to
performance with a moderate-size GPU (RTX 2080 TI 11GB, reach proper conclusions. Experimental results show that the
TABLE II
R ESULTS OF DIFFERENT MODELS ON THE TEST SET. I DENOTES THE PEDESTRIAN CROPPED IMAGE SEQUENCE , P DENOTES THE PEDESTRIAN ’ S POSE ,
BB DENOTES THE PEDESTRIAN ’ S BOUNDING BOX AND S DENOTES THE EGO - VEHICLE SPEED .

Model Backbone Features Params F1-score AUC

PCPA [15] C3D I, P , BB, S 31M 0.770 0.86
CAPformer [18] TimeSformer [19] I, BB, S 123M 0.779 0.853
RubiksNet I, BB, S 8M 0.749 0.839

Our model Video Masked AutoEncoder [20] I, BB, S 89M 0.843 0.914

proposed technique produces significantly better results than [14] Amir Rasouli, Iuliia Kotseruba, and John K. Tsotsos, “Understanding
those of leading models from the literature. These results pave pedestrian behavior in complex traffic scenes,” IEEE Transactions on
Intelligent Vehicles, vol. 3, no. 1, pp. 61–70, 2018.
the way towards expanding this architecture to include other [15] Iuliia Kotseruba, Amir Rasouli, and John K Tsotsos, “Benchmark
formations of the input data. for evaluating pedestrian action prediction,” in Proceedings of the
IEEE/CVF Winter Conference on Applications of Computer Vision,
2021, pp. 1258–1268.
R EFERENCES [16] Cunjun Yu, Xiao Ma, Jiawei Ren, Haiyu Zhao, and Shuai Yi, “Spatio-
temporal graph transformer networks for pedestrian trajectory predic-
[1] Mahir Gulzar, Yar Muhammad, and Naveed Muhammad, “A survey on tion,” in European Conference on Computer Vision. Springer, 2020, pp.
motion prediction of pedestrians and vehicles for autonomous driving,” 507–523.
IEEE Access, 2021. [17] Francesco Giuliari, Irtiza Hasan, Marco Cristani, and Fabio Galasso,
“Transformer Networks for Trajectory Forecasting,” arXiv, Mar. 2020.
[2] Khaled Saleh, “Pedestrian trajectory prediction for real-time autonomous
[18] Javier Lorenzo, Ignacio Parra Alonso, Rubén Izquierdo, Augusto Luis
systems via context-augmented transformer networks,” Sensors, vol. 22,
Ballardini, Álvaro Hernández Saz, David Fernández Llorca, and
no. 19, pp. 7495, 2022.
Miguel Ángel Sotelo, “Capformer: Pedestrian crossing action prediction
[3] Sirin Haddad, Meiqing Wu, He Wei, and Siew Kei Lam, “Situation-
using transformer,” Sensors, vol. 21, no. 17, pp. 5694, 2021.
aware pedestrian trajectory prediction with spatio-temporal attention
[19] Gedas Bertasius, Heng Wang, and Lorenzo Torresani, “Is Space-Time
model,” arXiv preprint arXiv:1902.05437, 2019.
Attention All You Need for Video Understanding?,” arXiv, Feb. 2021.
[4] Suresh Kumaar Jayaraman, Lionel P. Robert, X. Jessie Yang, and [20] Zhan Tong, Yibing Song, Jue Wang, and Limin Wang, “Videomae:
Dawn M. Tilbury, “Multimodal hybrid pedestrian: A hybrid automaton Masked autoencoders are data-efficient learners for self-supervised video
model of urban pedestrian behavior for automated driving applications,” pre-training,” arXiv preprint arXiv:2203.12602, 2022.
IEEE Access, vol. 9, pp. 27708–27722, 2021. [21] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
[5] Amir Rasouli, Iuliia Kotseruba, Toni Kunic, and John K Tsotsos, “Pie: Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention
A large-scale dataset and models for pedestrian intention estimation and is all you need,” in Advances in neural information processing systems,
trajectory prediction,” in Proceedings of the IEEE/CVF International 2017, pp. 5998–6008.
Conference on Computer Vision, 2019, pp. 6262–6271. [22] Yuxin Fang, Shusheng Yang, Shijie Wang, Yixiao Ge, Ying Shan, and
[6] Ajay Shrestha and Ausif Mahmood, “Review of deep learning algo- Xinggang Wang, “Unleashing vanilla vision transformer with masked
rithms and architectures,” IEEE access, vol. 7, pp. 53040–53065, 2019. image modeling for object detection,” arXiv preprint arXiv:2204.02964,
[7] Amir Rasouli, Iuliia Kotseruba, and John K Tsotsos, “Are they going 2022.
to cross? a benchmark dataset and baseline for pedestrian crosswalk [23] Lina Achaji, Julien Moreau, Thibault Fouqueray, Francois Aioun, and
behavior,” in Proceedings of the IEEE International Conference on François Charpillet, “Is attention to bounding boxes all you need
Computer Vision Workshops, 2017, pp. 206–213. for pedestrian action prediction?,” in 2022 IEEE Intelligent Vehicles
[8] Raúl Quintero Mı́nguez, Ignacio Parra Alonso, David Fernández-Llorca, Symposium (IV). IEEE, 2022, pp. 895–902.
and Miguel Ángel Sotelo, “Pedestrian path, pose, and intention predic- [24] Yang Liu, Yao Zhang, Yixin Wang, Feng Hou, Jin Yuan, Jiang Tian,
tion through gaussian process dynamical models and pedestrian activity Yang Zhang, Zhongchao Shi, Jianping Fan, and Zhiqiang He, “A survey
recognition,” IEEE Transactions on Intelligent Transportation Systems, of visual transformers,” arXiv preprint arXiv:2111.06091, 2021.
vol. 20, no. 5, pp. 1803–1814, 2019. [25] J Lorenzo, I Parra, and MA Sotelo, “Intformer: Predicting pedestrian
[9] Khaled Saleh, Mohammed Hossny, and Saeid Nahavandi, “Intent intention with the aid of the transformer architecture,” arXiv preprint
prediction of pedestrians via motion trajectories using stacked recurrent arXiv:2105.08647, 2021.
neural networks,” IEEE Transactions on Intelligent Vehicles, vol. 3, no. [26] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis-
4, pp. 414–424, 2018. senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani,
[10] Dimitrios Varytimidis, Fernando Alonso-Fernandez, Boris Duran, and Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and
Cristofer Englund, “Action and intention recognition of pedestrians in Neil Houlsby, “An Image is Worth 16x16 Words: Transformers for
urban traffic,” in 2018 14th International Conference on Signal-Image Image Recognition at Scale,” arXiv, Oct. 2020.
Technology Internet-Based Systems (SITIS), 2018, pp. 676–682. [27] Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier,
[11] Neha Sharma, Chhavi Dhiman, and S. Indu, “Pedestrian Intention Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back,
Prediction for Autonomous Vehicles: A Comprehensive Survey,” Neu- Paul Natsev, et al., “The kinetics human action video dataset,” arXiv
rocomputing, vol. 508, pp. 120–152, Oct. 2022. preprint arXiv:1705.06950, 2017.
[12] Tharindu Fernando, Simon Denman, Sridha Sridharan, and Clinton [28] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei,
Fookes, “Soft + Hardwired attention: An LSTM framework for human “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE
trajectory prediction and abnormal event detection,” Neural Networks, conference on computer vision and pattern recognition. Ieee, 2009, pp.
vol. 108, pp. 466–478, Dec. 2018. 248–255.
[13] Yi Fang, Yize Li, Asam Ahmed, and Siming You, “Development, [29] László A Jeni, Jeffrey F Cohn, and Fernando De La Torre, “Facing
economics and global warming potential of lignocellulose biorefinery,” imbalanced data–recommendations for the use of performance metrics,”
in Biomass, Biofuels, Biochemicals, pp. 1–13. Elsevier, Walthm, MA, in 2013 Humaine association conference on affective computing and
USA, Jan. 2021. intelligent interaction. IEEE, 2013, pp. 245–251.

Understanding Pedestrian Behavior in Complex Traffic Scenes: Amir Rasouli, Iuliia Kotseruba, and John K. Tsotsos
No ratings yet
Understanding Pedestrian Behavior in Complex Traffic Scenes: Amir Rasouli, Iuliia Kotseruba, and John K. Tsotsos
10 pages
Comsats University Islamabad: Department of Computer Science Lab Assignment - II
No ratings yet
Comsats University Islamabad: Department of Computer Science Lab Assignment - II
14 pages
Intelligent Pedestrian Intention Prediction Framework
No ratings yet
Intelligent Pedestrian Intention Prediction Framework
5 pages
Pedestrian Crossing Intention Prediction at Red-Light Using Pose Estimation
No ratings yet
Pedestrian Crossing Intention Prediction at Red-Light Using Pose Estimation
9 pages
Master Thesis Draft Pratik (1)
No ratings yet
Master Thesis Draft Pratik (1)
6 pages
Predicting Pedestrian Intention to Cross the Road
No ratings yet
Predicting Pedestrian Intention to Cross the Road
12 pages
Ham CIPF Crossing Intention Prediction Network Based On Feature Fusion Modules CVPRW 2023 Paper
No ratings yet
Ham CIPF Crossing Intention Prediction Network Based On Feature Fusion Modules CVPRW 2023 Paper
10 pages
Multi-Modal Hybrid Architecture For Pedestrian Action Prediction
No ratings yet
Multi-Modal Hybrid Architecture For Pedestrian Action Prediction
7 pages
CAPformer Pedestrian Crossing Action Prediction Us
No ratings yet
CAPformer Pedestrian Crossing Action Prediction Us
22 pages
6 CF 5
No ratings yet
6 CF 5
18 pages
oedestrian
No ratings yet
oedestrian
6 pages
1-s2.0-S0968090X2030855X-main
No ratings yet
1-s2.0-S0968090X2030855X-main
25 pages
Futuretransp 04 00034 v2
No ratings yet
Futuretransp 04 00034 v2
24 pages
LR_27-2-2025[1] (Recovered) 6-3(1)
No ratings yet
LR_27-2-2025[1] (Recovered) 6-3(1)
4 pages
Pedestrian and Cyclist Detection and Intent Estimation For Autonomous Vehicles A Survey
No ratings yet
Pedestrian and Cyclist Detection and Intent Estimation For Autonomous Vehicles A Survey
38 pages
Benchmark For Evaluating Pedestrian Action Prediction
No ratings yet
Benchmark For Evaluating Pedestrian Action Prediction
11 pages
Main
No ratings yet
Main
19 pages
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
From Everand
Visual Sensor Network: Exploring the Power of Visual Sensor Networks in Computer Vision
Fouad Sabry
No ratings yet
Driver Lane Change Intention Inference For Intelligent Vehicles Framework Survey and Challenges
No ratings yet
Driver Lane Change Intention Inference For Intelligent Vehicles Framework Survey and Challenges
14 pages
Graph Layout Support for Model-Driven Engineering
From Everand
Graph Layout Support for Model-Driven Engineering
Miro Spönemann
No ratings yet
Federated Personalized Mixture for Driver Intention Prediction
No ratings yet
Federated Personalized Mixture for Driver Intention Prediction
14 pages
Self-Driving Carpdf
No ratings yet
Self-Driving Carpdf
1 page
Highway Sensors
From Everand
Highway Sensors
Lucas Lee
No ratings yet
Pedestrians and Cyclists Intention Estimation For
No ratings yet
Pedestrians and Cyclists Intention Estimation For
10 pages
Communicating Awareness and Intent in
No ratings yet
Communicating Awareness and Intent in
12 pages
1 Synopsis On Pedestrian Controlling On Zebra Crossing
No ratings yet
1 Synopsis On Pedestrian Controlling On Zebra Crossing
7 pages
Computer Vision: Exploring the Depths of Computer Vision
From Everand
Computer Vision: Exploring the Depths of Computer Vision
Fouad Sabry
No ratings yet
Automatic Target Recognition: Advances in Computer Vision Techniques for Target Recognition
From Everand
Automatic Target Recognition: Advances in Computer Vision Techniques for Target Recognition
Fouad Sabry
No ratings yet
View Synthesis: Exploring Perspectives in Computer Vision
From Everand
View Synthesis: Exploring Perspectives in Computer Vision
Fouad Sabry
No ratings yet
Active Appearance Model: Unlocking the Power of Active Appearance Models in Computer Vision
From Everand
Active Appearance Model: Unlocking the Power of Active Appearance Models in Computer Vision
Fouad Sabry
No ratings yet
Electronics 10 03159 v2
No ratings yet
Electronics 10 03159 v2
22 pages
Computer Vision: Fundamentals and Applications
From Everand
Computer Vision: Fundamentals and Applications
Fouad Sabry
No ratings yet
IJRASET-IntentDetection
No ratings yet
IJRASET-IntentDetection
10 pages
Thesis Abstract Final
No ratings yet
Thesis Abstract Final
6 pages
Semaforos Automatizados en Intencion para Peatones - Uninersidad Tecnologica de Graz-Austria
No ratings yet
Semaforos Automatizados en Intencion para Peatones - Uninersidad Tecnologica de Graz-Austria
6 pages
Smart Railway Tracks
From Everand
Smart Railway Tracks
Serena Vaughn
No ratings yet
Pedestrian Detection: Domain Generalization, CNNS, Transformers and Beyond
No ratings yet
Pedestrian Detection: Domain Generalization, CNNS, Transformers and Beyond
13 pages
Urban_Traffic_Prediction_from_Mobility_Data_Using_Deep_Learning
No ratings yet
Urban_Traffic_Prediction_from_Mobility_Data_Using_Deep_Learning
7 pages
Machine Vision: Insights into the World of Computer Vision
From Everand
Machine Vision: Insights into the World of Computer Vision
Fouad Sabry
No ratings yet
(IJCST-V6I6P1) : Gandrapu Gideon, Prof. K. Venkata Rao
No ratings yet
(IJCST-V6I6P1) : Gandrapu Gideon, Prof. K. Venkata Rao
5 pages
Yang 等 - 2024 - A Multi-Task Learning Network With a Collision-Aware Graph Transformer for Traffic-Agents Trajectory
No ratings yet
Yang 等 - 2024 - A Multi-Task Learning Network With a Collision-Aware Graph Transformer for Traffic-Agents Trajectory
14 pages
Automatic Target Recognition: Fundamentals and Applications
From Everand
Automatic Target Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Computational Intelligence and Neuroscience - 2022 - Chen - An Effective Approach of Vehicle Detection Using Deep Learning
No ratings yet
Computational Intelligence and Neuroscience - 2022 - Chen - An Effective Approach of Vehicle Detection Using Deep Learning
9 pages
Divide and Conquer for Lane-Aware Diverse Trajectory Prediction_001231375
No ratings yet
Divide and Conquer for Lane-Aware Diverse Trajectory Prediction_001231375
4 pages
Object Detection: Advances, Applications, and Algorithms
From Everand
Object Detection: Advances, Applications, and Algorithms
Fouad Sabry
No ratings yet
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Modeling Driver Responses to Automation Failures With Active Inference
No ratings yet
Modeling Driver Responses to Automation Failures With Active Inference
12 pages
Ijaia 03
No ratings yet
Ijaia 03
15 pages
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
From Everand
Underwater Computer Vision: Exploring the Depths of Computer Vision Beneath the Waves
Fouad Sabry
No ratings yet
Rail Safety
From Everand
Rail Safety
Kai Turing
No ratings yet
Articulated Body Pose Estimation: Unlocking Human Motion in Computer Vision
From Everand
Articulated Body Pose Estimation: Unlocking Human Motion in Computer Vision
Fouad Sabry
No ratings yet
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Visualised Systems Engineering on Railway Projects
From Everand
Visualised Systems Engineering on Railway Projects
Jong-Pil Nam
No ratings yet
End-To-End Contextual Perception and Prediction With Interaction Transformer
No ratings yet
End-To-End Contextual Perception and Prediction With Interaction Transformer
8 pages
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
From Everand
Optical Braille Recognition: Empowering Accessibility Through Visual Intelligence
Fouad Sabry
No ratings yet
Percept: Fundamentals and Applications
From Everand
Percept: Fundamentals and Applications
Fouad Sabry
No ratings yet
Paper 1
No ratings yet
Paper 1
9 pages
Model-Driven Online Capacity Management for Component-Based Software Systems
From Everand
Model-Driven Online Capacity Management for Component-Based Software Systems
André van Hoorn
No ratings yet
Comparative Analysis of Machine-Learning Models Fo
No ratings yet
Comparative Analysis of Machine-Learning Models Fo
12 pages
The Model Improves The Ability To Predict Traffic Conflicts in Real
No ratings yet
The Model Improves The Ability To Predict Traffic Conflicts in Real
6 pages
TNT Target-driveN Trajectory Prediction
No ratings yet
TNT Target-driveN Trajectory Prediction
13 pages
Simatic: S7-SCL V 5.6 For S7-300/400
No ratings yet
Simatic: S7-SCL V 5.6 For S7-300/400
28 pages
Technical Program ICSMAI2024 V5
No ratings yet
Technical Program ICSMAI2024 V5
12 pages
Ids Unit 5
No ratings yet
Ids Unit 5
15 pages
Carlo Gavazzi Wm12 Communications Manual
No ratings yet
Carlo Gavazzi Wm12 Communications Manual
13 pages
Lesson 4
No ratings yet
Lesson 4
30 pages
ITN Module 5
No ratings yet
ITN Module 5
23 pages
215c9m车身线束 En
No ratings yet
215c9m车身线束 En
1 page
Error Details
No ratings yet
Error Details
3 pages
Unit01 - Introduction To Software Engineering
No ratings yet
Unit01 - Introduction To Software Engineering
10 pages
Handwriting Practice Makes Perfect - Lowercase
No ratings yet
Handwriting Practice Makes Perfect - Lowercase
58 pages
Final-project_Part2-Git-CLI (1)
No ratings yet
Final-project_Part2-Git-CLI (1)
3 pages
GSM Based Home Security System
No ratings yet
GSM Based Home Security System
19 pages
Technical Design Guide
No ratings yet
Technical Design Guide
150 pages
Software Sales
No ratings yet
Software Sales
4 pages
Assembly MIPS Quick Tutorial
No ratings yet
Assembly MIPS Quick Tutorial
8 pages
07 Control Systems Time Response Analysis
No ratings yet
07 Control Systems Time Response Analysis
4 pages
European E-Justice Portal
No ratings yet
European E-Justice Portal
27 pages
Introduction To Databricks SQL Answer Guide
No ratings yet
Introduction To Databricks SQL Answer Guide
6 pages
ATV71HC20N4 Document
No ratings yet
ATV71HC20N4 Document
8 pages
Smart Card Technology - Why Is A Smart Card So Smart
No ratings yet
Smart Card Technology - Why Is A Smart Card So Smart
19 pages
Your Grace and Mercy Tonic Solfa PDF
No ratings yet
Your Grace and Mercy Tonic Solfa PDF
1 page
Chapter I Draft
No ratings yet
Chapter I Draft
11 pages
Data Analytics Important Questions
No ratings yet
Data Analytics Important Questions
11 pages
SBIR Final Report P1 NoPri
No ratings yet
SBIR Final Report P1 NoPri
91 pages
Hostel Booking System Case Study: Task 1
No ratings yet
Hostel Booking System Case Study: Task 1
6 pages
MIL Merged
No ratings yet
MIL Merged
25 pages
A Methodological Approach For The Design of Composite Tanks Produced by Filament Winding-Landi2020
No ratings yet
A Methodological Approach For The Design of Composite Tanks Produced by Filament Winding-Landi2020
13 pages
File List
No ratings yet
File List
21 pages
5. Why people use augmented reality in heritage museums a socio-technical perspective
No ratings yet
5. Why people use augmented reality in heritage museums a socio-technical perspective
19 pages

An Optimized Hybrid Transformer-Based Technique For Real-Time Pedestrian Intention Estimation in Autonomous Vehicles

Uploaded by

An Optimized Hybrid Transformer-Based Technique For Real-Time Pedestrian Intention Estimation in Autonomous Vehicles

Uploaded by

An Optimized Hybrid Transformer-Based Technique

for Real-Time Pedestrian Intention Estimation in

assigned different attention weights based on the degree of B. Paper Contribution

Model Backbone Features Params F1-score AUC

You might also like