Counteracting_temporal_attacks_in_Video_Copy_Detec
Counteracting_temporal_attacks_in_Video_Copy_Detec
Detection
1 Introduction
Video content has become ubiquitous in the digital age, serving as a primary
medium for entertainment, education, and communication. As video-sharing
platforms grow in popularity, the need for effective copyright protection mech-
anisms becomes increasingly important. Video copy detection systems aim to
identify duplicates or near-duplicates of videos, enabling rights holders to enforce
intellectual property laws and combat unauthorized usage. Aside from copyright
protection, copy detection is an important tool in reducing database redundan-
cies or localizing the original recordings of DeepFake attacks.
The dual-level detection method, as described in [13,16] was recognized as
the winning solution in the prestigious META AI Challenge on video copy detec-
tion [13]. This approach combines Video Editing Detection (VED) with Frame
2 K. Fojcik, P. Syga
2 Related work
Video copy detection has been an active area of research for more than a decade,
with early methods focusing on descriptors such as histograms and block-based
matching techniques. As computational power and storage capabilities improved,
researchers shifted towards feature-based approaches, utilizing local descriptors
like Scale-Invariant Feature Transform (SIFT) and Speeded-Up Robust Features
(SURF). These methods enabled more accurate matching of video fragments,
even under transformations such as scaling, cropping, and rotation [17,10].
With the advent of deep learning, video copy detection has undergone a
paradigm shift. Convolutional Neural Networks (CNNs) like VGG [15], ResNet [7],
and models such as ViSiL [9] have significantly advanced frame-level feature ex-
traction for video similarity learning. Additionally, architectures like 3D-CNNs [11]
and encoder–decoder ConvLSTM models [3] capture spatiotemporal dynam-
ics, improving robustness to temporal edits. Advanced methods now incorpo-
rate transformer-based architectures, such as Video ViT [1], and self-supervised
learning frameworks like 3D-CSL [4] to further enhance the system’s ability to
detect heavily edited and transformed videos [4,5,2]. Lightweight approaches,
including multi-teacher distillation frameworks [12] and compact Siamese neu-
ral networks [6], have recently gained attention for achieving high efficiency and
scalability on large datasets.
The importance of dynamic frame selection based on interframe differences
to create highly compact video representations was shown in [6]. This approach
Counteracting temporal attacks in Video Copy Detection 3
On the other hand, [18] proposes a CNN-LSTM hybrid model for human
action recognition, evaluated on UCF101 and HMDB51 datasets. Their method
improves accuracy by up to 3.5% over CNN-only baselines but suffers from high
computational complexity and sensitivity to viewpoint variations, making the
solution unusable in real-world applications. The authors of [8] also indicate
time and memory limitations and propose Relational Self-supervised Distillation
with Compact Descriptors (RDCD) for image copy detection, using a lightweight
network and compact descriptors to improve efficiency. Their method employs
relational self-supervised distillation to transfer knowledge from a large teacher
network (ResNet-50) to a smaller student network (EfficientNet-B0) and intro-
duces contrastive learning with a hard negative loss to mitigate dimensional
collapse. Evaluated on the DISC2021, Copydays (CD10K), and NDEC datasets,
RDCD improves micro average precision (µAP) by about 5.0%, depending on
the descriptor sizes over the baseline while maintaining competitive performance
with significantly smaller descriptor sizes. However, the reliance on knowledge
distillation from a large network and additional computational costs during train-
ing remain limitations. Despite these advances, challenges persist. Many systems
struggle with efficiency when processing large-scale datasets, as feature extrac-
tion and matching remain computationally expensive. For example, as demon-
strated in [6], widely used models like ViSiL generate descriptors of 2025kB for
a typical 30-second video, with an inference speed of approximately 6.5 sam-
ples per second (sps), where one sample corresponds to a 30-second video. In
contrast, the method proposed in [6] achieves a descriptor size of just 1.875kB
while delivering an inference speed of up to 178.6 sps with satisfactory accuracy.
Similarly, the authors of [4] highlight that prevalent models such as VRL-F and
TCA-F (both frame-based matching) require over 10 seconds to generate a sin-
gle descriptor for a sample from the FIVR-200K dataset, with descriptor sizes
spanning several megabytes. ViSiL, by comparison, is even less efficient, taking
about 10 times longer and producing descriptors approximately 10 times larger.
The META AI Challenge on video copy detection, held in 2023, provided a rigor-
ous platform for evaluating state-of-the-art methods in video similarity analysis.
The challenge was designed to push the boundaries of video copy detection and
localization, attracting top research teams from around the world. It included
two tracks: video copy detection (VCD) and video copy localization (VCL). The
VCD track focused on identifying whether two videos shared copied content,
while the VCL track required participants to localize the temporal segments of
shared content within video pairs. In this paper we focus on VCD track.
The participants were constrained by computational resource limits to en-
sure solutions were scalable to real-world scenarios. The competition empha-
sized practical applications, such as content moderation, copyright protection,
and misinformation detection, highlighting the importance of efficient and accu-
rate solutions in this domain. The challenge introduced a benchmark dataset-
DVSC2023 and a strong baseline model to facilitate evaluation and comparison
of methods. Although the official competition has ended, Meta AI Video Simi-
larity Challenge and DVSC2023 are still available in the form of Open Arena12 .
3.1 Dataset
The DVSC2023 dataset was created using videos from the YFCC100M collec-
tion, filtered to ensure Creative Commons licenses and to exclude videos that
were too short or low-resolution. It contains a mixture of reference and query
videos, with query videos being transformed versions of reference videos or dis-
tractors. The videos were modified using various augmentations, such as spatial
updates (cropping, resizing), temporal edits (speed changes, frame alternation),
and complex transformations like screen capture simulation. These transforma-
tions were applied to create challenging scenarios for detecting copies. Examples
of video frames copied with applied transformations are presented in Fig. 1. The
data is partitioned into Training Split, that contains 8,404 query videos and
40,311 reference videos, with 2,708 queries containing copied segments, Valida-
tion Split, consisting of 8,295 query videos and 40,318 reference videos, with
2,641 queries containing copied segments, and Test Split of 8,015 query videos
evaluated against 2,519 reference videos, with 1,840 queries containing copied
segments. Additionally, 6,475, 6,369, and 6,175 distractor queries are included in
the training, validation, and test splits, respectively. These queries ensure that
most queries contain no copied segments, replicating real-world conditions.
The contestants received only the training and validation splits before submit-
ting their methods. Training data came with match results for all query-reference
pairs, while validation match results were hidden and known only to the orga-
nizers. After submission, participants received feedback on validation split per-
formance based on the ground truth. In Phase 2 of the challenge, the unseen
1
https://www.drivendata.org/competitions/219/competition-meta-vsc-desc-open/
2
https://www.drivendata.org/competitions/220/competition-meta-vsc-match-open/
Counteracting temporal attacks in Video Copy Detection 5
test data was used for the final evaluation and leaderboard ranking, ensuring
fair comparison across methods.
ranked prediction lists of all queries based on their confidence scores. Precision
and recall are calculated at each rank, with a positive when the query-reference
pair corresponds to a ground-truth match involving copied segments.
It is worth noting, that DVSC2023 dataset for the challenge does not include
queries with exact duplicate fragments (unedited) of reference videos. However,
it contains video pairs that are largely similar but exhibit subtle differences,
distinguishing them as distinct videos rather than identical copies, e.g., videos
may share the same background but feature different people.
The dual-level method uses VED to identify unedited videos, assigning ran-
dom descriptors with small norms and negative bias terms to those that are
unedited. Hence, it is able to handle these challenging video pairs, reporting a
5% improvement in matching accuracy. However, when tested under real-world
conditions, the method reveals a significant limitation. Our experiments show
that VED consistently misclassifies exact video copies from the reference set as
non-copies. We tested dual-level performance on 100 queries, which were exact
copies of chosen 100 references video. None of them was recognized as a copy.
This highlights a fundamental flaw in the current implementation of VED, mak-
ing it unsuitable for practical applications.
Additionally, the authors deterministically extract one frame from each sec-
ond of the video, which is a common approach used also in the baseline model
of challenge organizers. Although straightforward, this approach is vulnerable to
targeted temporal attacks, as discussed in Sect. 6.
5 Proposed Method
The motivation is that the first method, while more time efficient, targets
exact moments of scene change, making it vulnerable to temporal attacks (e.g.,
insertion of random frames). In contrast, the second method may offer better
resistance, yet requires more time to identify the frame. Using either method, we
reduced the number of frames by 40 to over 150 times, which is 1.4 to 5.8 times
more efficient than the standard 1 fps approach shown in [16]. For a sample video
with an fps of 24 and a total of 719 frames, we select 17 frames when using a
Hanning window of size 30 (Fig. 2), 13 frames with a Hanning window of size
50 (Fig. 3), and only 5 frames with a Hanning window of size 100 (Fig. 4). This
corresponds to a reduction in the number of frames for a video by factors of
approximately 42, 55, and 144, respectively. In contrast, a simple one-frame-per-
second approach reduces the frames by a factor of only 24. However, it should be
noted that smoothing with larger Hanning windows may result in the omission
of frames from shorter scenes, as the larger the window, the greater the reduction
in frames, which can be also observed in Fig. 5, where we present selected frames
from the first 10 seconds of the sample video obtained using different methods.
This phenomenon is investigated during our experiments (cf. Tab. 1).
Fig. 2: Interframe differences curve before and after smoothing with Hanning
window of size 30, and selected frames of a sample video.
8 K. Fojcik, P. Syga
Fig. 3: Interframe differences curve before and after smoothing with Hanning
window of size 50, and selected frames of a sample video.
Fig. 4: Interframe differences curve before and after smoothing with Hanning
window of size 100, and selected frames of a sample video.
Counteracting temporal attacks in Video Copy Detection 9
Fig. 5: Selected frames from the first 10 seconds of a sample video obtained using
different experimental methods.
10 K. Fojcik, P. Syga
Random Frame Blackouts : The frames were blacked out in the original
video with probabilities of p = 1/25 and p = 1/10. Naturally, the included
black frames modify the related frames during video compression, affecting
relative P- and B- frames, hence after video decoding more than 1 frame is
affected. The number of affected frames depends on the blacked out frame,
as it influences the I-, P-, and B-frame selection.
Targeted Frame Blackouts : In this attack, the middle frame of each second
of the video was blacked out in the same manner as in the first attack. Note
that, such precise attacks are visually imperceptible to users, making them
a significant threat to video copy detection systems.
Speed modification We used ffmeg to modify the tempo at which the video is
played. In technical terms, it keeps all the frames but saves them, as a video
with changed fps, so that we get the effect of acceleration or deceleration.
Naturally, it may influence the I-, P-, and B-frame selection.
6 Experimental Results
The winning team extracted a smaller validation set from the original training
dataset for their experiments, reducing its size by a factor of four (1681 queries).
We conduct our experiments using the same data.
In order to analyze the efficacy of each frame selection method as well as the
size of the smoothing window, we measured µAP in a scenario reflecting the
challenge. Moreover, we analyze the influence of the temporal attacks, described
in Sect. 5.2. The results are shown in Tab. 1. One can note that the highest
efficacy is obtained by the second selection strategy with Hanning window of
size 30, with the same method and window size 50 being close second. Moreover,
one can observe that the temporal attacks do have an impact on the matching
efficacy, in particular Targeted attack influences the correct detection rate. As
the first row of the table shows, an incorrectly selected frame representation can
reduce the efficacy by up to 60%. Finally, one can note that the second selection
strategy is more stable, resulting in significantly lower standard deviation, when
compared with the same window size approach for the first strategy.
6.2 Efficiency
Due to the ubiquitous nature of videos, the VCD system used in copyright man-
agement or DeepFake detection should be time efficient. Additionally, the more
Counteracting temporal attacks in Video Copy Detection 11
Table 1: Matching results [µAP] for proposed frames extraction methods de-
pending on Hanning window size and applied temporal attack;
Local-max-windowX – local maxima from the interframe differences curve,
smoothed with Hanning window of size X,
Local-max-mid-windowX – middle frames between local maxima from the inter-
frame differences curve, smoothed with Hanning window of size X. The random
attack is averaged in 3 independent runs, the standard deviation in given in
parenthesis.
Attack type Random attack Random attack Targeted attack
No attack
Method (p = 25
1
) (p = 10
1
) (Middle Frame/s)
Local-max-window30 0.9300 0.8086 (0.0184) 0.8012 (0.0167) 0.3705
Local-max-mid-window30 0.9300 0.8807 (0.0034) 0.8747 (0.0021) 0.8835
Local-max-window50 0.9252 0.8361 (0.0068) 0.8085 (0.0070) 0.6891
Local-max-mid-window50 0.9261 0.8549 (0.0013) 0.8555 (0.0014) 0.8577
Local-max-window100 0.8273 0.7531 (0.0095) 0.7564 (0.0106) 0.8138
Local-max-mid-window100 0.8343 0.7759 (0.0050) 0.7667 (0.0016) 0.8004
videos we want to track, the stored representation may be an issue, hence com-
pact video representation is required. As shown in Table 2, our frame extraction
method with smoothing window of size 50 reduces inference time by the factor
of 2, allowing analysis of over 3 video samples per second, as compared to 1.58
video sample per second in [16]. Moreover, the representation of each video is
reduced by almost 56%. The relative speedup and representation size reduction
are obtained with a minor performance trade-off, reducing the efficacy by less
than 1 % µAP. One can note, that reducing the size of the smoothing window
results in higher efficacy (4‰ increase in comparison to window size 50; 4‰
worse than Dual-level), yet reduces significantly the time and memory gain.
Table 2: Inference performance, total size of descriptors and matching results for
original and proposed methods.
Inference Total descriptors
Method Match result [µAP]
performance [vid/s] size [Mb]
Dual-level [16] 1.58 57.09 0.9343
Local-max-mid-window30 2.39 40.71 0.9300
Local-max-mid-window50 3.27 25.16 0.9261
Table 3 presents the results under both Random and Targeted temporal
attacks, showing the vulnerability of Dual-level detection. In particular, Targeted
attack causes over 60% drop in µAP with comparison to a scenario with no
temporal attacks. On the other hand, our approach suffers around 5% loss from
the attakcs in case of window size 30, and 7% in case of size 50. Those results show
the robustness of the frame selection technique against temporal attacks, as well
as the severity of the attack scope with only 1 frame per second being modified.
Moreover, for random attacks, one can observe a lower standard deviation of
the proposed method, indicating higher stability, with larger window resulting
in lower variance. Note that the efficacy difference between two windows sizes
is lower than 3% for all the attack variants, showing the reasonability of using
larger window, when the time and memory restrictions are severe.
Another typical temporal domain attack that is used to hinder copy detection
is modification of video speed. The results of such tests are shown in Tab. 4. We
examined two factors of acceleration (1.2 and 1.5) and slowdown by the factor
of 2. One can note, that the proposed method of frame extraction is invariant on
the acceleration factor, whereas the Dual-level approach reduces its efficiency,
the more, the faster the video is played. This is due to the deterministic frame
selection which limits its adaptability to time compression or extension, resulting
in almost 7% drop, allowing our approach to outperform it by similar degree.
Table 3: Matching results [µAP] for dual-level and proposed methods depending
on applied temporal attack: Frame blackouts. For random type attacks the result
is averaged over 3 attempts, the standard deviation in given in parenthesis.
Attack type Random attack Random attack Targeted attack
No attack
Method (p = 25
1
) (p = 10
1
) (Middle Frame/s)
Dual-level [16] 0.9343 0.8788 (0.0066) 0.8674 (0.0072) 0.3705
Local-max-mid-window30 0.9300 0.8807 (0.0034) 0.8747 (0.0021) 0.8835
Local-max-mid-window50 0.9261 0.8549 (0.0013) 0.8555 (0.0014) 0.8577
Table 4: Matching results [µAP] for dual-level and proposed methods depending
on a dataset with Speed Modification attack.
Acceleration factor
1.0 (No attack) 1.2 1.5 0.5
Method
Dual-level [16] 0.9343 0.9052 0.8709 0.9284
Local-max-mid-window30 0.9300 0.9300 0.9300 0.9300
Local-max-mid-window50 0.9261 0.9261 0.9261 0.9261
Counteracting temporal attacks in Video Copy Detection 13
7 Conclusion
This paper focuses on the problem of Video Copy Detection and addresses the
limitations of the Dual-level detection method [16] that was successful in Meta
AI Challange. We propose an improved method of video representation by more
adaptive frame selection. Additionally, we analyze the performance of the de-
tection method against three proposed temporal attacks. Using local maxima of
interframe differences, the proposed method reduces computational costs while
keeping comparable efficiency measured as micro-average precision. The perfor-
mance difference on the DVSC2023 dataset is less than 1%, while increasing
resilience against temporal attacks. Our experiments show that our method re-
duces the number of frames required to properly represent a video by 1.4 to 5.8
times when compared to the standard one-frame-per-second approach. Moreover,
in the performed tests, our method achieved 2 times faster inference, which is
important in real-world applications when processing massive video databases.
Moreover, the proposed approach efficacy showed to be invariant on video speed
manipulations, whereas the previous method suffered a 7% µAP drop. Similar
results were obtained for frame blackout, both random and targeted, showing
the resistance of our method.
Future research will focus on developing adaptive temporal alignment tech-
niques to enhance robustness against significant frame modifications (like black-
outs), as well as integrating feature matching techniques that allow reducing the
size of representation for even better performance enhancement. Additionally,
expanding the approach to handle localized transformations and cross-modal
attacks (e.g., temporal modifications, room attack and overlays as shown in
Fig. 1) will further strengthen its effectiveness. Another branch of research is
further improvement of inference time, so that large-scale video databases may
be checked for copies. Finally, the VCD models should be investigated for their
interpretability, so that they follow XAI research trend and be compliant with
jurisdical restrictions for real-world legal usage.
References
1. Arnab, A., Dehghani, M., Heigold, G., Sun, C., Lučić, M., Schmid, C.: Vivit: A
video vision transformer. In: Proceedings of the IEEE/CVF international confer-
ence on computer vision. pp. 6836–6846 (2021)
2. Black, A., Jenni, S., Bui, T., Tanjim, M.M., Petrangeli, S., Sinha, R., Swami-
nathan, V., Collomosse, J.: Vader: Video alignment differencing and retrieval. In:
Proceedings of the IEEE/CVF International Conference on Computer Vision. pp.
22357–22367 (2023)
14 K. Fojcik, P. Syga
3. Chiang, T.H., Tseng, Y.C., Tseng, Y.C.: A multi-embedding neural model for in-
cident video retrieval. Pattern Recognition 130, 108807 (2022)
4. Deng, R., Wu, Q., Li, Y.: 3d-csl: self-supervised 3d context similarity learning for
near-duplicate video retrieval. In: 2023 IEEE International Conference on Image
Processing (ICIP). pp. 2880–2884. IEEE (2023)
5. Deng, R., Wu, Q., Li, Y., Fu, H.: Differentiable resolution compression and align-
ment for efficient video classification and retrieval. In: ICASSP 2024-2024 IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP).
pp. 3200–3204. IEEE (2024)
6. Fojcik, K., Syga, P., Klonowski, M.: Extremely compact video representation for
efficient near-duplicates detection. Pattern Recognition 158, 111016 (2025)
7. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In:
Proceedings of the IEEE conference on computer vision and pattern recognition.
pp. 770–778 (2016)
8. Kim, J., Woo, S., Nang, J.: Relational self-supervised distillation with compact
descriptors for image copy detection (2024), https://arxiv.org/abs/2405.17928
9. Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, I.: Visil:
Fine-grained spatio-temporal video similarity learning. In: Proceedings of the
IEEE/CVF international conference on computer vision. pp. 6351–6360 (2019)
10. Kordopatis-Zilos, G., Papadopoulos, S., Patras, I., Kompatsiaris, Y.: Near-
duplicate video retrieval with deep metric learning. In: Proceedings of the IEEE
international conference on computer vision workshops. pp. 347–356 (2017)
11. Li, J., Zhang, H., Wan, W., Sun, J.: Two-class 3d-cnn classifiers combination
for video copy detection. Multimedia Tools and Applications 79(7-8), 4749–4761
(2020)
12. Ma, Z., Dong, J., Ji, S., Liu, Z., Zhang, X., Wang, Z., He, S., Qian, F., Zhang, X.,
Yang, L.: Let all be whitened: Multi-teacher distillation for efficient visual retrieval.
In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 38, pp. 4126–
4135 (2024)
13. Pizzi, E., Kordopatis-Zilos, G., Patel, H., Postelnicu, G., Ravindra, S.N., Gupta,
A., Papadopoulos, S., Tolias, G., Douze, M.: The 2023 video similarity dataset and
challenge. Computer Vision and Image Understanding 243, 103997 (2024)
14. Pizzi, E., Roy, S.D., Ravindra, S.N., Goyal, P., Douze, M.: A self-supervised de-
scriptor for image copy detection. Proc. CVPR (2022)
15. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference
on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015,
Conference Track Proceedings (2015)
16. Wang, T., Ma, F., Liu, Z., Rao, F.: A dual-level detection method for video copy
detection. arXiv preprint arXiv:2305.12361 (2023)
17. Wu, X., Hauptmann, A.G., Ngo, C.W.: Practical elimination of near-duplicates
from web video search. In: Proceedings of the 15th ACM international conference
on Multimedia. pp. 218–227 (2007)
18. Zhong, J.L., Gan, Y.F., Yang, J.X.: Efficient detection of intra/inter-frame video
copy-move forgery: A hierarchical coarse-to-fine method. Journal of Information
Security and Applications 85, 103863 (2024). https://doi.org/https://doi.
org/10.1016/j.jisa.2024.103863, https://www.sciencedirect.com/science/
article/pii/S2214212624001650