7 IEEE ITSS Vehicle Detection and Classification Without Residual Calculation Accelerating HEVC Image Decoding With Random Perturbation Injection
7 IEEE ITSS Vehicle Detection and Classification Without Residual Calculation Accelerating HEVC Image Decoding With Random Perturbation Injection
Abstract—In the field of video analytics, particularly traffic decoding the entire video bitstream, which can be computa-
surveillance, there is a growing need for efficient and effective tionally intensive and time-consuming. In traffic surveillance
methods for processing and understanding video data. Tradi- applications, where real-time video understanding is crucial,
tional full video decoding techniques can be computationally
intensive and time-consuming, leading researchers to explore the computational overhead associated with full video decod-
alternative approaches in the compressed domain. This study ing can be a significant bottleneck. Consequently, researchers
introduces a novel random perturbation-based compressed do- have begun to explore alternative approaches that leverage the
main method for reconstructing images from High Efficiency compressed domain to reduce processing time and maintain
Video Coding (HEVC) bitstreams, specifically designed for traffic the effectiveness of video understanding tasks [1], [2], [3]. By
surveillance applications. To the best of our knowledge, our
method is the first to propose substituting random perturbations directly analyzing the compressed video bitstream, it is pos-
for residual values, thereby creating a condensed representation sible to extract relevant information for video understanding
of the original image while retaining information relevant to video tasks without the need for full decoding.
understanding tasks, particularly focusing on vehicle detection In this study, we introduce a novel method for recon-
and classification as key use cases. structing images from HEVC bitstream by injecting random
By not using any residual data, our proposed method sig-
nificantly reduces the amount of data needed in the image perturbations as a substitute for residual values, significantly
reconstruction process, allowing for more efficient storage and speeding up the reconstruction process compared to standard
transmission of information. This is particularly important when intra decoding. To the best of our knowledge, our method
considering the vast amount of video data involved in surveillance is the first to propose substituting residual values instead of
applications. Applied to the public BIT-Vehicle dataset, we calculating them, thereby creating a condensed representation
demonstrate a significant increase in the reconstruction speed
compared to the traditional full decoding approach, with our pro- of the original image while retaining information pertinent
posed random perturbation-based method being approximately to video understanding tasks, particularly focusing on vehicle
56% faster than the pixel domain method. Additionally, we detection and classification as key use cases. By operating
achieve a detection accuracy of 99.9%, on par with the pixel directly in the compressed domain, our method avoids the
domain method, and a classification accuracy of 96.84%, only computational overhead associated with full video decoding,
0.98% lower than the pixel domain method. Furthermore, we
showcase the significant reduction in data size, leading to more leading to a more efficient and effective solution for traffic
efficient storage and transmission. Our research establishes the monitoring and management.
potential of compressed domain methods in traffic surveillance In video compression standards like HEVC, the encoded
applications, where speed and data size are critical factors. The bitstream holds a substantial amount of data related to pre-
study’s findings can be extended to other object detection tasks, diction error for image samples, known as residuals. For I-
such as pedestrian detection, and future work may investigate
the integration of compressed and pixel domain information, as frames, these residuals make up about 85-90% of the total
well as the extension of these methods to the full video decoding data in the bitstream[4]. Furthermore, residual coding accounts
process, encompassing both intra and inter encoded bitstreams. for an average of 77% and 84% of the total bits for dynamic
Index Terms—H.265/HEVC, Compressed Domain Video Ana- continuous and discrete video textures, respectively[5]. By not
lytic, Vehicle Classification, Video Surveillance, Real-time Video using any residual data, our proposed method significantly
Analysis reduces the amount of data needed in the image reconstruction
process, allowing for more efficient storage and transmission
of information. This is particularly important when consid-
I. I NTRODUCTION ering the vast amount of video data involved in surveillance
applications.
I N recent years, the rapid growth of video data and the
increasing demand for efficient video analytics have led
researchers to seek new methods for analyzing and processing
To evaluate the performance of our proposed method,
we conduct experiments on the public BIT-Vehicle dataset,
video streams. Traditional video processing techniques require a large-scale dataset comprising diverse vehicle types and
imaging conditions. Our results demonstrate that the proposed
1 Signal Processing for Computational Intelligence Research Group, method is able to reconstruct images approximately 56% faster
Informatics Institute, İstanbul Technical University, Maslak, Turkey than the pixel domain method, while maintaining a high level
{beratoglu, toreyin}@itu.edu.tr
This work was supported by The Scientific and Technical Research Council of detection and classification accuracy. In particular, our
of Turkey (TUBITAK) under the grant number 121E378. method achieves a detection accuracy of 99.9%, on par with
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 2
the pixel domain method, and a classification accuracy of methods. Some notable review papers on this topic include
96.84%, only 0.98% lower than the pixel domain approach. those by Yang and Pun-Cheng [8], Wang et al. [9], and
These results highlight the potential of our compressed domain Zou et al. [10], which discuss the state-of-the-art in vehicle
method for traffic surveillance applications, where both speed classification, as well as the challenges and future research
and data size are critical factors. directions.
Additionally, the increasing emphasis on data privacy and In this section, we focus on methods that utilize the BIT
the need to comply with regulations such as the EU General Vehicle dataset, as this dataset allows us to directly compare
Data Protection Regulation (GDPR) and the California Privacy our results with these works. We provide an overview of five
Rights Act (CPRA) make data minimization a crucial consid- representative vehicle classification methods based on the BIT
eration in video analytics. These regulations mandate that only Vehicle dataset.
the necessary data be collected to fulfill a certain purpose. A Dong et al. [11] proposed a vehicle type classification
recent study presented a method to reduce the amount of per- method using a semi-supervised convolutional neural network
sonal data needed for machine learning predictions by remov- from vehicle frontal-view images. They introduced sparse
ing or generalizing some input features of the runtime data, Laplacian filter learning to obtain the filters of the network
using knowledge distillation approaches [6]. Furthermore, the with large amounts of unlabeled data and trained the network
work by He et al. introduces a lightweight image encryption on the challenging BIT-Vehicle dataset. The method demon-
scheme using compressive sensing and data hiding to enhance strated the effectiveness of using deep learning for vehicle
privacy and data security in smart city applications, reinforcing classification in complex scenes.
the importance of advanced encryption and access control Roecker et al. [12] proposed a convolutional neural network
mechanisms in ensuring data security [7]. In the context of our model for vehicle type classification using low-resolution
proposed method, operating directly in the compressed domain images from a frontal perspective. They trained the model on
and significantly reducing the amount of data needed for image a subset of the BIT-Vehicle dataset and achieved an accuracy
reconstruction addresses the data minimization requirement set of 93.90%, proving the model to be discriminative and capable
out in these regulations. By minimizing the data used for video of generalizing the patterns of the vehicle type classification
understanding tasks, our method not only enhances processing task.
efficiency but also helps organizations comply with privacy Sang et al. [13] proposed a new vehicle detection model
regulations. called YOLOv2 Vehicle based on YOLOv2. They used the
By showcasing the potential of compressed domain methods k-means++ clustering algorithm to cluster vehicle bounding
for video understanding tasks in traffic surveillance applica- boxes on the training dataset, improved the loss calculation
tions, this study contributes to the growing body of research method for bounding box dimensions, and adopted a multi-
aimed at overcoming performance bottlenecks in video ana- layer feature fusion strategy. The model achieved a mean Aver-
lytics. The faster reconstruction time and reduced data size age Precision (mAP) of 94.78% on the BIT-Vehicle validation
associated with these methods make them a promising option dataset.
for certain types of applications where speed and data size are Wu et al. [14] proposed a multi-scale vehicle detection
important considerations. Further research into the potential method by improving YOLOv2 to address the foreground-
applications and improvements of these methods could lead background class imbalance and varying vehicle sizes in a
to significant advancements in the field of video analytics. scene. They introduced a new anchor box generation method
The remainder of this paper is organized as follows: Section called Rk-means++ and incorporated Focal Loss into YOLOv2
2 provides an overview of related work in object detection for vehicle detection. The method demonstrated better perfor-
and video compression. Section 3 discusses the basic blocks mance on vehicle localization and recognition on the BIT-
of HEVC encoding and decoding, with a focus on intra- Vehicle public dataset compared to other existing methods.
prediction. Section 4 presents our proposed approach for object Taheri Tajar et al. [15] developed a lightweight real-time
detection and classification in compressed domain videos. Sec- vehicle detection model based on the Tiny-YOLOv3 network.
tion 5 provides experimental results, and Section 6 concludes They pruned and simplified the network and trained it on
the paper. the BIT Vehicle dataset, achieving an mAP of 95.05% and
a detection speed of 17 fps, which is about two times faster
II. R ELATED W ORKS than the original Tiny-YOLOv3 network.
In our work, we adopt the YOLOv7 framework [16] as
In this section, we review existing literature on vehicle clas-
the basis for our vehicle classification method. We focus
sification in both the pixel and compressed domains, with a
on achieving comparable accuracy to pixel domain methods
focus on methods using the BIT Vehicle dataset. We briefly
while operating in the compressed domain. By utilizing the
discuss various techniques and recent advancements while
strengths of YOLOv7 and adapting it to work with HEVC
highlighting how our approach differs from existing methods.
intra features, we aim to develop a computationally efficient
vehicle classification method that maintains high accuracy.
A. Vehicle Classification in Pixel Domain
Vehicle classification in the pixel domain has been a popular B. Related Works in Compressed Domain
research topic over the years, with numerous review papers Recent years have witnessed a growing interest in developing
providing comprehensive overviews of various techniques and object detection and classification methods in the compressed
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 3
domain. In this section, we review some of the most relevant formance, emphasizes enhancing image quality with a low-
works and discuss how our approach differs from them. complexity, high-performance design. Unlike their method,
Donghai Zhai et al. [17] provided a comprehensive overview which centers on chrominance texture enhancement and
of object detection methods in the compressed domain across encoder-decoder optimization, our approach is dedicated to ef-
various video compression standards, including MPEG-2, ficiently reconstructing images for video analytics, specifically
H.264, and HEVC. They highlighted different ways of uti- in vehicle detection and classification.
lizing motion vector information for object detection and Choi and Bajic [26] presented a human detection method
analyzed the techniques under various compression standards. based on HEVC intra coding syntax elements, including block
Among the many works presented in their review, we have size, intra prediction modes, and transform coefficient levels.
chosen the ones that focus on the HEVC compressed domain Their approach did not require full bitstream decoding but
for a more detailed comparison with our approach. focused on human detection rather than vehicle classification.
Zhao et al. [18] proposed a real-time moving object seg- Wang et al. [27] developed a highway vehicle counting
mentation and classification method for surveillance videos method in the compressed domain using low-level features ex-
using HEVC compressed domain features. Their approach tracted from coding-related metadata. Their method is compet-
only classified objects as persons or vehicles, while our method itive with pixel-domain approaches in terms of computational
classifies vehicles into six specific types. cost but focuses on counting vehicles rather than classifying
Chan et al. [19] showed that tuning DNNs with compressed them into distinct types.
data enhances detection accuracy in vehicle detection systems. Our method differs from these works in several ways. We
Their findings confirm that DNN performance is stable even at classify vehicles into six specific types, providing a more
high compression ratios up to 160:1, making a significant case detailed classification for intelligent transportation applica-
for using compressed data in automated driving applications tions. Furthermore, we are the first to suggest using random
without loss of critical information. perturbation to reconstruct a frame without employing residual
Deguerre et al. explore the impact of video compression data, which makes our method less computationally demand-
on traffic flow rate estimation using deep learning from ing. While most of these works rely on motion vectors, our
MPEG4 part-2 compressed video streams [20]. Their findings approach exploits the potential of intra features. Since a video
underscore the potential for deep learning models to effectively consists of intra and inter frames, incorporating motion vectors
utilize compressed data, enhancing the efficiency of traffic in future works could further enhance our method. By using
management systems. the state-of-the-art YOLOv7, we demonstrate close accuracy
In the work by Cai et al. [21], the authors propose a new to pixel domain methods, showcasing the effectiveness of our
video coding strategy in HEVC tailored for object detection, approach.
focusing on task-specific bit allocation and the influence of
each pixel on detection algorithms. In contrast, our approach
III. H IGH E FFICIENCY V IDEO C ODING (HEVC)
enhances the existing HEVC framework by introducing ran-
dom perturbations for image reconstruction, aiming to con- The High Efficiency Video Coding (HEVC) is a video com-
dense the original image efficiently while retaining essential pression standard that was developed jointly by the ITU-T
information for vehicle detection and classification. This rep- Video Coding Experts Group and the ISO/IEC Moving Picture
resents an optimization of the current decoding algorithm, Experts Group. It is designed to achieve higher compression
prioritizing speed and data efficiency. efficiency compared to its predecessor, the H.264/MPEG-4
Chen et al. [22] introduced a fast object detection method AVC standard. HEVC achieves higher compression efficiency
in the HEVC intra compressed domain. Their method used by introducing new tools and techniques such as larger block
partitioning depths, prediction modes, and residuals for object sizes, more prediction modes, and more efficient entropy
detection, whereas our approach omits residuals and achieves coding [28], [29].
good results with less computational demand. The HEVC compression algorithm processes the video data
Alizadeh and Sharifkhani [23] present a novel moving using a hierarchical organization of blocks. The encoding
object detection method in the H.265/HEVC compressed do- process begins with partitioning a frame into Coding Tree
main, utilizing a conditional random field (CRF) model. Their Units (CTUs), which are further partitioned into Coding Units
approach extracts and analyzes block-specific data like motion (CUs), Transform Units (TUs), and Prediction Units (PUs).
vectors (MVs), partitioning modes, and bit consumption from The prediction process can be either inter or intra. Intra pre-
the compressed bitstream for object detection. diction, also known as intra-frame prediction, is a technique to
Feng et al. [24] proposed a fast framework for semantic remove spatial correlation. It uses information from previously
video segmentation, named TapLab, which utilized motion coded blocks within the same frame to predict the content of
vectors and residuals from compressed videos. Unlike their the current block. On the other hand, inter-prediction is used
method, we focus on intra features and do not rely on motion to remove the temporal correlation. It uses information from
vectors. previously coded frames to predict the content of the current
Yang et al. [25] focused on improving the texture of com- frame [30]. Our method is applied to intra predicted frames.
pressed chrominance components using a luminance-guided In intra prediction, the prediction is done by using reference
chrominance enhancement network and online learning. Their samples and prediction modes. Reference samples are blocks
approach, which optimizes both encoder and decoder per- of image or video data that are used as a reference for
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 4
Fig. 1. Reconstructing CTUs with HEVC Standard Decoding Process and with Random Perturbation Based Residual Substitution.
predicting the values of other blocks. They are extracted at entropy decoding block, which extracts syntax elements such
the boundary from the upper and left blocks adjacent to the as partition structure, prediction modes, and residual data. The
current PU. When reference samples are not available, they can decoder generates the CTU prediction by employing prediction
be generated by copying samples from the closest available modes and utilizing reference samples. The reference samples
references. If no reference samples are available, a nominal consist of neighboring pixels found within the image. The
average sample value (typically 128) is used in their place. residual data is then added to the predicted CTU to generate
HEVC uses several intra-prediction modes, such as Angular, the final CT Upx . Note that, the process of reconstructing
DC, and Planar, to achieve better compression performance. CTUs with HEVC standard decoding process and estimated
The intra prediction modes use the same set of reference residuals is actually done in Coding Unit (CU) level. However,
samples, and there are 33 prediction modes in total [30]. for simplicity and ease of understanding, we have presented
The HEVC standard employs the discrete cosine transform the process in the context of CTUs.
(DCT) and the discrete sine transform (DST) to encode TUs. 1) Standard Reconstruction: Let CU (x, y) be the intensity
The residuals are the difference between predictions and value of a reconstructed Coding Unit of an image I at the
original pixel values, and they are held in TUs. The block spatial coordinates (x, y). The intensity value of the CU can
structure, prediction modes, and the quantized data are entropy be expressed as the sum of its prediction value, P (x, y), and
coded and transmitted. its residual value, R(x, y), as shown in the following equation:
Ipx Irp σ=1 Irp σ=2 Irp σ=3 Irp σ=4 Irp σ=5
Fig. 3. Effect of Varying Standard Deviations in Random Perturbation for Image Reconstruction.
tasks. According to its paper, YOLOv7 outperforms previous approaches. The results demonstrate the effectiveness of our
models and has set a significant benchmark in the field. method in achieving efficient and accurate vehicle detection
The YOLOv7 family includes the YOLOv7-Tiny model, and classification within the HEVC compressed domain.
which is the smallest in the family with just over 6 million
parameters. Despite its compact size, the YOLOv7-Tiny model
A. The Experimental Setup
achieves a validation AP of 35.2%, surpassing the performance
of previous YOLO-Tiny models. The experimental setup for comparing our proposed method
In our proposed method, we utilize the reconstructed im- begins with obtaining HEVC bitstreams from the JPEG format
ages, generated using Random Perturbations, as input for images, Iorg , in the BIT database through intra-encoding using
the Vehicle Detection and Classification component. These the HEVC encoder [33].
reconstructed images, referred to as Irp images, are used to Then, from these bitstreams, we generate the pixel domain
train a model using the Darknet framework [32]. The trained image Ipx and the Random Perturbation Images Irp . Note that
model is then applied to the task of vehicle detection and Ipx images are first encoded and then decoded from Iorg to
classification within the HEVC compressed domain. ensure a fair comparison, as each source image faces the same
By leveraging the YOLOv7 object detector and the recon- HEVC encoding distortion. To further compare with previous
structed images, our proposed method is able to efficiently compressed domain approaches [34], [35], we also generate
and accurately detect and classify vehicles in the compressed Ibp (block partition based) and Ipu (prediction unit based)
domain, without the need for residual data. images from the same bitstream.
Next, we fine-tuned the YOLOv7-Tiny models using pre-
trained weights provided by [16], ensuring consistent hyper-
V. E XPERIMENTAL R ESULTS
parameters such as learning rate, batch size, and number of
This section presents the experimental results of our proposed epochs across all scenarios. Throughout the training process,
method. We begin with an overview of the experimental setup, which included over 250,000 batches, we diligently monitored
detailing the hardware and software configurations used in the models’ performance on the validation set. We selected the
the experiments, and dataset description. Next, we provide a weights that demonstrated the highest Mean Average Precision
comparison of time efficiency and accuracy with other relevant (mAP) to ensure robust generalization.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 7
Vpx Modelpx
Inverse
Dequantization + Loop Filters YOLOv7 Tiny
Transform
Vrp Modelrp
Intra
Irp Generator YOLOv7 Tiny
Prediction
Vpm Modelpm
Entropy
Ipu Generator YOLOv7 Tiny
Decoder
Vbp Modelbp
Fig. 4. Comparison of four different input images generated from the same HEVC bitstream and fed into their corresponding YOLOv7-Tiny networks for
vehicle detection and classification.
In addition to these models, we trained a separate model comprises 9580 vehicle images featuring six types of vehi-
on the original JPEG images, Iorg , to serve as a benchmark. cles: sedans, sport-utility vehicles (SUVs), microbuses, trucks,
To explore the impact of standard deviation variations in buses, and minivans. The dataset exhibits varying frequencies
random perturbations, we also trained distinct models for of vehicle types. Specifically, the number of vehicles per
each specified standard deviation. Overall, 14 different models class is as follows: 558 buses, 883 microbuses, 476 minivans,
were trained for the vehicle classification task, and 4 models 5922 sedans, 1392 SUVs, and 822 trucks. These images were
were dedicated to vehicle detection, enabling comprehensive captured by road surveillance cameras and include both day
analysis across varying conditions. and night scenes, as well as sunny days with no background
Figure 4 illustrates the four different input images generated noise, rain, snow, people, or other vehicle types.
from the bitstream and fed into the YOLOv7-Tiny networks. To ensure a fair comparison with previous works, the dataset
We use the following abbreviations for vehicle detection and was divided into a training set and a validation set with a
classification based on different image types: ratio of 8:2, containing 7880 and 1970 images, respectively.
• Vbp : Using Ibp (Block Partition Based) images. This ratio was also maintained for each vehicle type to
• Vrp : Using Irp (Random Perturbation Images). ensure a balanced representation across classes. Among these
• Vpu : Using Ipu (Prediction Unit Based) images. images, approximately 1000 and 250 were nighttime images
• Vpx : Using Ipx (Pixel Domain) images. for training and validation, respectively.
To conduct these experiments, the following hardware and
software configurations are employed:
C. Reconstruction Time Comparison
• Computer: A computer with an Intel(R) Core(TM) i9-
9900X CPU, NVIDIA GeForce GTX 1080 Ti GPU, and We first measure the time taken for different steps of the
48 GB RAM running a Windows 11 64-bit operating HEVC reconstruction process. Then, we calculate the image
system is used. reconstruction time for these methods. Finally, we measure
• Software: The reference software for the H.265/HEVC the total elapsed time, including both the reconstruction and
coding standard, known as HM (version 16.20), is used inference time for vehicle detection and classification.
for both encoding and decoding purposes. The ”Main 1) Measurement of Reconstruction Steps: The following
profile” is used for encoding, with 4:2:0 color encoding processes are measured to determine the reconstruction time
and a quantization parameter of 32 [33]. of each method.
• Compiler: The Microsoft Visual Studio 2019 (v142) plat-
form tool-set is used to compile the reference software. • Entropy Decoding (ED): This is a common step for both
methods.
• Intra Prediction (IP): This is a common step for both
B. BIT Vehicle Dataset methods.
We selected the BIT dataset [11] for our experiments due to • Residual Decompression (RD): This step is skipped for
its widespread use in previous research and the diverse set Irp .
of images it offers for vehicle classification. The BIT-Vehicle • Loop Filters (LF): This is a common step for both
dataset, provided by the Beijing Institute of Technology, methods.
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 8
TABLE I TABLE IV
M EASUREMENT OF R ECONSTRUCTION S TEPS . V EHICLE D ETECTION ACCURACY FOR D IFFERENT M ETHODS .
TABLE II
C OMPARISON OF I MAGE R ECONSTRUCTION T IMES .
among all instances of the class in the dataset), average
intersection of union (Avg. IoU), and mean average precision
Time(ms) Irp Ipx (mAP) at the IoU threshold of 0.50 ([email protected]).
Average 23.33 36.25
Minimum 18.00 26.00
[email protected] is a metric commonly used to evaluate the
Maximum 34.00 51.00 performance of object detection algorithms. It is calculated
as the mean of the Average Precision (AP) for each class in
a dataset (9).
2) Comparison of Image Reconstruction Time: The elapsed
time to reconstruct Irp , denoted as T (Irp ), can be calculated n
1X
using the following equation: mAP = APi (9)
n i=1
T (Irp ) = T (ED) + T (IP ) + T (LF ) (7) where n is the number of classes in the dataset and APi is
Similarly, the elapsed time to reconstruct Ipx , denoted as the Average Precision for class i.
T (Ipx ), can be calculated as: To calculate AP, the algorithm’s predictions are first sorted
by their confidence scores. Then, AP is calculated as the area
under the precision-recall curve with an IoU threshold of 0.50,
T (Ipx ) = T (ED) + T (IP ) + T (RD) + T (LF ) (8) as shown in (10).
The results presented in Table II demonstrate that the Pn
proposed method has significantly faster construction times P (k) · rel(k)
AP = k=1 Pn (10)
compared to traditional full decoding. The time required k=1 rel(k)
to generate an image in the compressed domain using the
where P (k) is the precision at cut-off k, and rel(k) is a binary
Random Perturbation Image (Irp ) method averaged 23.33 ms,
indicator of whether the prediction at cut-off k is a true positive
a 35.6% reduction in time compared to the pixel domain’s
or not, considering an IoU threshold of 0.50.
36.25 ms. These results indicate that the proposed methods can
2) Vehicle Detection Accuracy: The BIT Vehicle dataset
substantially decrease the time required for image generation.
is originally annotated for six different vehicle types. To
3) Comparison of Total Elapsed Time: The total time spent
measure the vehicle detection performance, the dataset has
for both reconstruction and classification is presented in Table
been re-annotated, combining all vehicle types into a single
III. Vehicle classification using the YOLO Convolutional Neu-
”vehicle” class. This approach focuses solely on the detection
ral Network (CNN) takes approximately 2ms for both pixel
of vehicles, which can enhance the model’s accuracy by
and compressed domain methods, as it performs detection
simplifying the task to detect the presence of any vehicle rather
and classification simultaneously. The key difference between
than distinguishing between different types.
the two methods lies in the average reconstruction time. The
In scenarios such as perimeter security or quick traffic
compressed domain method (Vrp ) is significantly faster, taking
counts where the specific type of vehicle is less critical,
only 25.33ms compared to the pixel domain method (Vpx )
prioritizing detection accuracy over classification can provide
which takes 38.24ms. This demonstrates the efficiency of our
more reliable data. This is especially relevant in environments
compressed domain method in reducing the overall time spent
where rapid and accurate vehicle detection is paramount, and
on reconstruction and classification tasks.
the additional information provided by classification does not
significantly alter response or outcome.
D. Accuracy Comparison
The models are retrained using these four different image
1) Metrics: The results are evaluated using the F1-score, types. The obtained vehicle detection performance is presented
precision (the proportion of correct detections among all posi- in Table IV.
tive predictions), recall (the proportion of correct detections The results suggest that the Random Perturbation Image
Reconstruction method (Vrp ) achieves an Average Precision
TABLE III (AP) of 99.99% for vehicle detection, matching the perfor-
C OMPARISON OF THE T OTAL T IME S PEND FOR R ECONSTRUCTION AND mance of the pixel domain approach. Additionally, the BP
C LASSIFICATION . and PU methods, which do not utilize random residuals, also
Method Time (ms) exhibit accuracy that closely approximates that of our proposed
Vrp 25.33 method. This highlights that all compressed domain methods
Vpx 38.24 provide commendable accuracy for vehicle detection. Given
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 9
TABLE VI
V EHICLE C LASSIFICATION ACCURACY FOR D IFFERENT M ETHODS .
Looking ahead, we plan to further our research in several [3] R. Torfason, F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and
key areas: L. Van Gool, “Towards image understanding from deep compression
without decoding,” in 6th International Conference on Learning
• While the BIT-Vehicle dataset proved adequate for initial Representations, ICLR 2018 - Conference Track Proceedings, 2018.
validations, future studies will utilize datasets with greater [Online]. Available: http://arxiv.org/abs/1803.06131
[4] J. Stankowski et al., “Analysis of compressed data stream content in
variety and complexity to fully evaluate the robustness hevc video encoder,” International Journal of Electronics and Telecom-
and adaptability of our approach under diverse real-world munications, vol. 61, pp. 121–127, 2015.
conditions. [5] A. V. Katsenou, M. Afonso, and D. R. Bull, “Study of compression
statistics and prediction of rate-distortion curves for video texture,”
• We will expand the application of our method to both Signal Processing: Image Communication, vol. 101, 2022.
intra and inter encoded bitstreams, exploring broader uses [6] A. Goldsteen, G. Ezov, R. Shmelkin, M. Moffie, and A. Farkash, “Data
within video processing technologies. minimization for gdpr compliance in machine learning models,” AI and
• While our current approach eliminates residuals and Ethics, vol. 2, no. 3, pp. 477–491, 2022.
[7] X. He, L. Li, H. Peng, and F. Tong, “An efficient image privacy preser-
approximates them using random perturbations, another vation scheme for smart city applications using compressive sensing and
intriguing direction for future research could involve multi-level encryption,” IEEE Transactions on Intelligent Transportation
transmitting a minimal amount of data that includes key Systems, pp. 1–15, 2024.
[8] Z. Yang and L. S. Pun-Cheng, “Vehicle detection in intelligent trans-
statistical parameters—such as the mean and standard portation systems and its applications under varying environments: A
deviation—of the original data. This approach aims to review,” Image and Vision Computing, vol. 69, pp. 143–154, 2018.
provide insights into the data’s variability with minimal [9] Z. Wang, J. Zhan, C. Duan, X. Guan, P. Lu, and K. Yang, “A review of
updates to the transmitted data, potentially enhancing the vehicle detection techniques for intelligent vehicles,” IEEE Transactions
on Neural Networks and Learning Systems, pp. 1–21, 2022.
reconstruction fidelity. [10] Z. Zou, K. Chen, Z. Shi, Y. Guo, and J. Ye, “Object detection in 20
• Additionally, we will explore the potential of our tech- years: A survey,” Proceedings of the IEEE, vol. 111, no. 3, pp. 257–276,
nique for other object detection tasks, such as pedestrian 2023.
[11] Z. Dong, Y. Wu, M. Pei, and Y. Jia, “Vehicle type classification using
detection and license plate recognition. a semisupervised convolutional neural network,” IEEE Transactions on
In summary, our study significantly contributes to the field Intelligent Transportation Systems, vol. 16, no. 4, pp. 2247–2256, 2015.
of compressed domain video analytics by demonstrating a [12] M. N. Roecker, Y. M. G. Costa, J. L. R. Almeida, and G. H. G.
Matsushita, “Automatic vehicle type classification with convolutional
viable method to minimize data and computational demands neural networks,” in 2018 25th International Conference on Systems,
in traffic surveillance and potentially other related fields. The Signals and Image Processing (IWSSIP), Maribor, Slovenia, 2018, pp.
advancements presented not only underscore the capabilities of 1–5.
[13] J. Sang, Z. Wu, P. Guo, H. Hu, H. Xiang, Q. Zhang, and B. Cai, “An
compressed domain methods but also highlight their growing improved yolov2 for vehicle detection,” Sensors, vol. 18, no. 12, p. 4272,
importance in the efficient processing of large-scale video data. 2018.
[14] Z. Wu, J. Sang, Q. Zhang, H. Xiang, B. Cai, and X. Xia, “Multi-
scale vehicle detection for foreground-background class imbalance with
R EFERENCES improved yolov2,” Sensors, vol. 19, no. 15, p. 3336, 2019.
[15] A. Taheri Tajar, A. Ramazani, and M. Mansoorizadeh, “A lightweight
[1] R. V. Babu, M. Tom, and P. Wadekar, “A survey on compressed domain tiny-yolov3 vehicle detection approach,” Journal of Real-Time Image
video analysis techniques,” Multimedia Tools and Applications, vol. 75, Processing, vol. 18, pp. 2389–2401, 2021.
no. 2, pp. 1043–1078, 2016. [16] C.-Y. Wang, A. Bochkovskiy, and H.-Y. M. Liao, “Yolov7: Trainable
[2] M. Javed, P. Nagabhushan, and B. B. Chaudhuri, “A review on document bag-of-freebies sets new state-of-the-art for real-time object detectors,”
image analysis techniques directly in the compressed domain,” Artificial arXiv preprint, 2022.
Intelligence Review, pp. 1–30, 2017. [17] D. Zhai, X. Zhang, X. Li, X. Xing, Y. Zhou, and C. Ma, “Object detec-
IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS 11
tion methods on compressed domain videos: An overview, comparative Behçet Uğur Töreyin received the B.S. degree
analysis, and new directions,” Measurement, vol. 207, p. 112371, 2023. from the Middle East Technical University, Ankara,
[18] L. Zhao, Z. He, W. Cao, and D. Zhao, “Real-time moving object segmen- Turkey in 2001 and the M.S. and Ph.D. degrees
tation and classification from hevc compressed surveillance video,” IEEE from Bilkent University, Ankara, in 2003 and 2009,
Transactions on Circuits and Systems for Video Technology, vol. 28, respectively, all in electrical and electronics engi-
no. 6, pp. 1346–1357, 2018. neering. He is now an Associate Professor with the
[19] P. H. Chan, A. Huggett, G. Souvalioti, P. Jennings, and V. Donzella, Informatics Institute at Istanbul Technical Univer-
“Influence of avc and hevc compression on detection of vehicles through sity. His research interests broadly lie in signal pro-
faster r-cnn,” IEEE Transactions on Intelligent Transportation Systems, cessing and pattern recognition with applications to
vol. 25, no. 1, pp. 203–213, 2024. computational intelligence. His research is focused
[20] B. Deguerre, C. Chatelain, and G. Gasso, “End-to-end traffic flow on developing novel algorithms to analyze and com-
rate estimation from mpeg4 part-2 compressed video streams,” IEEE press signals from multitude of sensors such as visible/infra-red/hyperspectral
Transactions on Intelligent Transportation Systems, vol. 25, no. 8, pp. cameras, microphones, passive infra-red sensors, vibration sensors and spec-
8949–8959, 2024. trum sensors for wireless communications.
[21] Q. Cai, Z. Chen, D. O. Wu, S. Liu, and X. Li, “A novel video coding
strategy in hevc for object detection,” IEEE Transactions on Circuits and
Systems for Video Technology, vol. 31, no. 12, pp. 4924–4937, 2021.
[22] L. Chen, H. Sun, J. Katto, X. Zeng, and Y. Fan, “Fast object detection
in hevc intra compressed domain,” in 2021 29th European Signal
Processing Conference (EUSIPCO), Dublin, Ireland, 2021, pp. 756–760.
[23] M. Alizadeh and M. Sharifkhani, “Compressed domain moving object
detection based on crf,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 30, no. 3, pp. 674–684, 2020.
[24] J. Feng, S. Li, X. Li, F. Wu, Q. Tian, M. Yang, and H. Ling,
“Taplab: A fast framework for semantic video segmentation tapping into
compressed-domain knowledge,” IEEE Transactions on Pattern Analysis
& Machine Intelligence, vol. 44, no. 03, pp. 1591–1603, 2022.
[25] R. Yang, H. Liu, S. Zhu, X. Zheng, and B. Zeng, “Dfce: Decoder-
friendly chrominance enhancement for hevc intra coding,” IEEE Trans-
actions on Circuits and Systems for Video Technology, vol. 33, no. 3,
pp. 1481–1486, 2023.
[26] H. Choi and I. V. Bajic, “Hevc intra features for human detection,” in
2017 IEEE Global Conference on Signal and Information Processing
(GlobalSIP), Montreal, QC, Canada, 2017, pp. 393–397.
[27] Z. Wang, X. Liu, J. Feng, J. Yang, and H. Xi, “Compressed-domain
highway vehicle counting by spatial and temporal regression,” IEEE
Transactions on Circuits and Systems for Video Technology, vol. 29,
no. 1, pp. 263–274, 2019.
[28] T. Wiegand, G. J. Sullivan, S. Member, G. Bjøntegaard, and A. Luthra,
“Overview of the h.264/avc video coding standard,” IEEE Transactions
on Circuits and Systems for Video Technology, vol. 13, no. 7, 2003.
[29] G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, “Overview of
the high efficiency video coding (hevc) standard,” IEEE Transactions
on Circuits and Systems for Video Technology, vol. 22, no. 12, 2012.
[30] J. Lainema, F. Bossen, W. J. Han, J. Min, and K. Ugur, “Intra coding
of the hevc standard,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 22, no. 12, pp. 1792–1801, 2012.
[31] K. R. Castleman, Digital image processing. Prentice Hall Press, 1996.
[32] “Darknet: Open source neural networks in C,” https://pjreddie.com/
darknet/, accessed: 2023-05-13.
[33] J.-V. H. reference software, “version hm 16.9,” https://hevc.hhi.
fraunhofer.de/svn/svn\ HEVCSoftware/tags/HM-16.9/, n.d.
[34] M. S. Beratoglu and B. U. Toreyin, “Vehicle license plate detection using
only block partitioning structure of the high efficiency video coding
(hevc),” in 27th Signal Processing and Communications Applications
Conference, SIU 2019. Institute of Electrical and Electronics Engineers
Inc., 2019.
[35] M. S. Beratoğlu and B. U. Töreyin, “Vehicle license plate detector in
compressed domain,” IEEE Access, vol. 9, pp. 95 087–95 096, 2021.