An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS
An Efficient Transformer-Based System For Text-Based Video Segment Retrieval Using FAISS
Abstract:- An efficient system for text-based video [1]. These methods often lack the semantic understanding
segment retrieval is presented, leveraging transformer- required for accurate retrieval.
based embeddings and the FAISS library for
similarity search. The sys- tem enables users to Transformer Models in NLP and CV
perform real-time, scalable searches over video datasets Transformers have revolutionized NLP and computer
by converting video segments into combined text and vision tasks. Models like BERT [2] and Vision
image embeddings. Key components include video Transformers [3] have achieved state-of-the-art results in
segmentation, speech-to-text transcription using various applications. Wav2Vec 2.0 [4] has shown significant
Wav2Vec 2.0, frame extraction, embedding generation improvements in speech recognition tasks, making it
using Vision Transformers and Sentence Transformers, suitable for automatic transcription.
and efficient similarity search using FAISS.
Experimental results demonstrate the system’s Similarity Search with FAISS
applicability in media archives, education, and content FAISS [5] is a library for efficient similarity search
discovery, even when applied to a small dataset. and clustering of dense vectors. It has been used extensively
for large-scale information retrieval tasks due to its speed
I. INTRODUCTION and scalability.
Frame Extraction Indexing: All embeddings are added to the FAISS index.
Key frames are extracted from each video segment to
capture visual information. Similarity Search
The retrieval process involves querying the FAISS
Frame Sampling: Six equally spaced frames per segment index with a user-provided text query.
are extracted.
Query Embedding: The query is embedded using the
Timestamp Calculation: Using np. linspace, timestamps same Sentence Transformer model.
are generated excluding the start and end to avoid
redundancy. Normalization: The query embedding is L2- normalized
and zero-padded to match the combined embedding size.
Frame Extraction: Frames are captured using MoviePy’s
get frame method. Similarity Search: FAISS retrieves the top-k most
similar video segments based on cosine similarity.
Saving Frames: Extracted frames are saved as JPEG
images using Pillow [8]. Experiments and Results
VI. CONCLUSION
REFERENCES
APPENDIX
Implementation Details
Hardware: Experiments were conducted on a ma- chine with an Intel Core i7 CPU and 16 GB RAM.
Software: Python 3.8, PyTorch 1.9.0, FAISS 1.7.0.
Libraries: Transformers, SentenceTransformers, MoviePy, Librosa, NumPy.
Parameters