0% found this document useful (0 votes)
9 views

An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS

An efficient system for text-based video segment retrieval is presented, leveraging transformer- based embeddings and the FAISS library for similarity search. The sys- tem enables users to perform real-time, scalable searches over video datasets by converting video segments into combined text and image embeddings.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS

An efficient system for text-based video segment retrieval is presented, leveraging transformer- based embeddings and the FAISS library for similarity search. The sys- tem enables users to perform real-time, scalable searches over video datasets by converting video segments into combined text and image embeddings.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Volume 9, Issue 9, September– 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24SEP1105

An Efficient Transformer-Based System for


Text-Based Video Segment Retrieval Using FAISS
Sai Vivek Reddy Gurram1
Independent Researcher

Abstract:- An efficient system for text-based video [1]. These methods often lack the semantic understanding
segment retrieval is presented, leveraging transformer- required for accurate retrieval.
based embeddings and the FAISS library for
similarity search. The sys- tem enables users to  Transformer Models in NLP and CV
perform real-time, scalable searches over video datasets Transformers have revolutionized NLP and computer
by converting video segments into combined text and vision tasks. Models like BERT [2] and Vision
image embeddings. Key components include video Transformers [3] have achieved state-of-the-art results in
segmentation, speech-to-text transcription using various applications. Wav2Vec 2.0 [4] has shown significant
Wav2Vec 2.0, frame extraction, embedding generation improvements in speech recognition tasks, making it
using Vision Transformers and Sentence Transformers, suitable for automatic transcription.
and efficient similarity search using FAISS.
Experimental results demonstrate the system’s  Similarity Search with FAISS
applicability in media archives, education, and content FAISS [5] is a library for efficient similarity search
discovery, even when applied to a small dataset. and clustering of dense vectors. It has been used extensively
for large-scale information retrieval tasks due to its speed
I. INTRODUCTION and scalability.

The exponential growth of video content across III. METHODOLOGY


various platforms necessitates efficient methods for
indexing and retrieving relevant segments based on textual The system comprises several components that work in
queries. Traditional methods often rely on metadata or tandem to enable efficient video segment retrieval.
manual annotations, which are neither scalable nor efficient.
Recent advances in transformer-based models for natural  Video Segmentation
language processing and computer vision offer new avenues
for automating this process.  Videos are divided into fixed-length segments to create
manageable units for processing and retrieval.
In this paper, an integrated system is proposed that  Video Loading: Videos are loaded using MoviePy [6],
utilizes state-of-the-art transformer models to generate providing access to metadata and content.
embeddings for both the visual and textual content of video  Duration Calculation: The total duration is computed to
segments. By indexing these embeddings using Facebook determine the number of segments.
AI Similarity Search (FAISS), fast and accurate retrieval of  Segmentation Loop: The video is iterated over,
video segments in response to textual queries is enabled. extracting 30-second clips.
The contributions of this work are:  Subclipping: Each segment is saved as an individual
MP4 file using standard codecs.
A comprehensive pipeline for video segmentation,
transcription, frame extraction, and embedding gen- eration.  Speech-to-Text Transcription
The audio from each video segment is converted into
An efficient method for combining text and image text transcripts.
embeddings to create a rich representation of video
segments.  Model Used: Wav2Vec 2.0 (”facebook/wav2vec2- large-
960h”) [4].
The application of FAISS for scalable and real-time
similarity search over video datasets.  Processor: Wav2Vec2Processor handles audio pre-
processing and decoding.
II. RELATED WORK
 Audio Loading: Librosa [7] loads and resamples audio
 Video Retrieval Systems
to 16 kHz, as required by the model.
Previous work in video retrieval has focused on
keyword- based search and content-based retrieval using
low-level features like color histograms and motion vectors

IJISRT24SEP1105 www.ijisrt.com 1574


Volume 9, Issue 9, September– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24SEP1105

 Frame Extraction  Indexing: All embeddings are added to the FAISS index.
Key frames are extracted from each video segment to
capture visual information.  Similarity Search
The retrieval process involves querying the FAISS
 Frame Sampling: Six equally spaced frames per segment index with a user-provided text query.
are extracted.
 Query Embedding: The query is embedded using the
 Timestamp Calculation: Using np. linspace, timestamps same Sentence Transformer model.
are generated excluding the start and end to avoid
redundancy.  Normalization: The query embedding is L2- normalized
and zero-padded to match the combined embedding size.
 Frame Extraction: Frames are captured using MoviePy’s
get frame method.  Similarity Search: FAISS retrieves the top-k most
similar video segments based on cosine similarity.
 Saving Frames: Extracted frames are saved as JPEG
images using Pillow [8].  Experiments and Results

 Image Embedding Generation  Dataset


Embeddings for each extracted frame are generated The system was evaluated on a dataset comprising 10
using a Vision Transformer. minutes of video content, consisting of a single video clip
segmented into multiple parts. Although the dataset is small,
 Model Used: ViT-B/16 pre-trained on ImageNet [3]. it serves as a proof of concept for the system’s functionality.

 Preprocessing: Frames are resized and normalized.  Evaluation Metrics


Functionality Verification: Ensuring each component
 Embedding Extraction: Each frame is passed through the of the system works as intended.
ViT model to obtain a 768-dimensional embedding.
 Retrieval Examples: Demonstrating retrieval results for
 Aggregation: Embeddings from all frames in a segment sample queries.
are averaged to form a single image embedding.
 Performance Metrics: Measuring processing time for
 Text Embedding Generation indexing and retrieval.
Embeddings for the transcripts of each video segment
are generated. IV. RESULTS

 Model Used: SentenceTransformer’s ”all-MiniLM- L6- Functionality: All components—including video


v2” [9]. segmentation, transcription, frame extraction, embedding
generation, and similarity search—functioned correctly.
 Embedding Extraction: Transcripts are converted into
embeddings of size 384.  Retrieval Examples:
Sample queries returned relevant video segments,
 Combining Embeddings indicating that the system can effectively match textual
To create a multimodal representation, text and image queries to video content.
embeddings are combined.
 Processing Time:
 Concatenation: The 384-dimensional text embed- ding
and the 768-dimensional image embedding are  Indexing Time: The small dataset allowed for rapid
concatenated to form a 1,152-dimensional vector. indexing, taking approximately 2 minutes.
 Alignment: Embeddings are ensured to correspond to the  Query Response Time: Average query response time
correct video segments. was under 0.1 seconds, demonstrating real-time
performance.
 Indexing with FAISS
The combined embeddings are indexed using FAISS V. DISCUSSION
for efficient similarity search.
Due to the limited size of the dataset, quantitative
 Normalization: Embeddings are L2-normalized. metrics like Mean Average Precision (mAP) are less
meaningful. However, the successful implementation and
 Index Type: IndexFlatIP is used for inner product operation of the system on this dataset validate the
(cosine similarity) search. approach. The system can be scaled to larger datasets for
more comprehensive evaluations in future work.

IJISRT24SEP1105 www.ijisrt.com 1575


Volume 9, Issue 9, September– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24SEP1105

VI. CONCLUSION

An efficient system for text-based video segment


retrieval was introduced, leveraging transformer-based
embeddings and FAISS for similarity search. By combining
text and image embeddings, a richer representation of video
content is captured, leading to effective retrieval
performance. The system operates in real-time and
demonstrates potential applicability in media archives,
education, and content discovery. Future work includes
scaling the system to larger datasets and conducting
extensive evaluations.

REFERENCES

[1]. C. G. M. Snoek, M. Worring, and A. W. M. Smeul-


ders. Early versus late fusion in semantic video anal-
ysis. In Proceedings of the 13th Annual ACM Inter-
national Conference on Multimedia, pages 399–402,
Singapore, 2005.
[2]. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
Kristina Toutanova. Bert: Pre-training of deep bidi-
rectional transformers for language understanding.
In Proceedings of the 2019 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies,
pages 4171–4186, Minneapolis, MN, 2019.
[3]. Alexey Dosovitskiy, Lucas Beyer, Alexander
Kolesnikov, Dirk Weissenborn, et al. An
image is worth 16x16 words: Transformers for image
recognition at scale. In International Conference on
Learning Representations, 2021.
[4]. Alexei Baevski, Henry Zhou, Abdelrahman
Mohamed, and Michael Auli. wav2vec 2.0: A
framework for self- supervised learning of speech
representations. Ad- vances in Neural Information
Processing Systems, 33:12449–12460, 2020.
[5]. Jeff Johnson, Matthijs Douze, and Herv´e
J´egou. Billion-scale similarity search with gpus.
IEEE Trans- actions on Big Data, 7(3):535–547,
2019.
[6]. F. Zulko. MoviePy: Video editing with Python, 2015.
Zenodo.
[7]. Brian McFee, Colin Raffel, Dawen Liang, Daniel
PW Ellis, Matt McVicar, Eric Battenberg, and Oriol
Ni- eto. librosa: Audio and music signal analysis in
python. In Proceedings of the 14th Python in Science
Conference, pages 18–25, Austin, TX, 2015.
[8]. Clark. Pillow (PIL Fork) Documentation, 2015.
Python Imaging Library (PIL).
[9]. Nils Reimers and Iryna Gurevych. Sentence-bert:
Sentence embeddings using siamese bert-networks.
In Proceedings of the 2019 Conference on Empirical
Methods in Natural Language Processing, pages
3982– 3992, Hong Kong, China, 2019.

IJISRT24SEP1105 www.ijisrt.com 1576


Volume 9, Issue 9, September– 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165 https://doi.org/10.38124/ijisrt/IJISRT24SEP1105

APPENDIX

 Implementation Details

 Hardware: Experiments were conducted on a ma- chine with an Intel Core i7 CPU and 16 GB RAM.
 Software: Python 3.8, PyTorch 1.9.0, FAISS 1.7.0.
 Libraries: Transformers, SentenceTransformers, MoviePy, Librosa, NumPy.

 Parameters

 Video Segment Length: 30 seconds.


 Frames per Segment: 6.
 Embedding Dimensions:
 Text: 384.
 Image: 768.
 Combined: 1,152.

IJISRT24SEP1105 www.ijisrt.com 1577

You might also like