PMRF Annual Review
Deep-Learning assisted Sports Video Analytics
By
Vipul Baghel
Department of Electrical Engineering
Indian Institute of Technology Gandhinagar
Gandhinagar, Gujarat, India
Thesis Advisor:
Dr. Ravi Sadananda Hegde
Roll no: 22350005
(Electrical Engineering) PMRF ID: 1703280
Presentation Outline
❖ Journey
❖ Introduction
◆ Motion Understanding
◆Background and Motivation
❖ Work Done
❖ Future Work
❖ References
2
Timeline
Artificial Intelligence
Optimization in Machine Learning
Transformer and GNN
Writing
Nature Inspired Computing
Machine Learning
Digital Image Processing
Probabilistic Machine Learning
HSS Course Independent Project - Seminar
HSS Course
Sem 1 Sem 2 Sem 3 Sem 4 Sem 5 Sem 6 Sem 7 Sem 8 Sem 9
Aug Pre-synopsis
Qualifiers-I Qualifiers-II Proposal defense
2022 seminar and
(30 Oct - 4 (Jan) (May)
Nov) Thesis
submission
CV based Fine-grained
Design
analytics highly dynamic
sports-specific
using sota motion analytics
long-term S&S
HPE module
tracking
Detection & Classification
of fine-grained sports
Course credits completed: 43
-specific motion on wild 3
Thesis credits completed: 28 videos
Deep-Learning assisted Motion Understanding
● Definition:
This process involves multiple sub-tasks such as:
Motion understanding in computer vision refers to the
process of analyzing, interpreting, and modeling dynamic
● Motion Estimation: e.g., optical flow, trajectory prediction.
changes in visual scenes over time, with a focus on
extracting meaningful patterns of movement from
● Action Recognition: identifying specific human actions.
sequences of images or videos.
● Temporal Segmentation: detecting the start and end points
It aims to identify, localize, classify, and understand
of motion events.
temporal events and actions performed by
objects—particularly humans—in a scene.
● Pose Estimation & Tracking: localizing body joints over
time.
● Motion Pattern Modeling: learning spatio-temporal
representations of dynamic behavior.
4
Temporal Action Localization (TAL)
Work Done
Objective 1
● Overview[1]:
● A novel, well-annotated dataset is collected from 20 YouTube boxing
sparring/practice videos featuring 18 athletes (11 male, 7 female) with a
total duration of approximately 4 hours.
○ The dataset includes key body joints, punch start/end times,
and labels for six fine-grained punch categories:
Cross, Jab, Lead Hook, Lead Uppercut, Rear Hook, and Rear
Uppercut, capturing diverse techniques.
● Overall, there are 6915 detected, demarcated, and labeled boxing action
subclips in the dataset, with an average duration of 1 second.
5
Work Done
Objective 1
● Continue…
● A hierarchical framework is proposed for punch detection, demarcation, extraction, and classification
from raw combat sports videos.
The two-step approach first detects boundaries and extracts fixed-length subclips, then classifies
punches into six categories.
● The task of combat activity detection is framed as a regression problem rather than a classification
problem, where each rolling window is scored to help identify subclip frame boundaries.
● Demonstration of the utility of the proposed methodology as a home-training tool for boxers using
cost-effective consumer-grade video capture.
6
Work Done
Objective 1
● Results:
Classification performance of the proposed pipeline on the testing dataset
Use case of punch classes of multiple subjects
7
Work Done
Objective 2
● Overview[2]:
● Unsupervised Skeleton-Based Learning Framework:
We introduce an Attention-based Spatio-Temporal Graph Neural
Network (ASTGCN) encoder, pretrained via blockwise
pose-sequence learning. This approach captures blockwise motion
dynamics, and scene transitions are identified using change in the
curvature of motion dynamics sequence, introduced as Action
Dynamics Sequence (ADM).
● Empirical Validation:
Our model achieves performance comparable to supervised methods
on the DSV Diving dataset. Additionally, we demonstrate the
generalization capability of our approach on out-of-distribution,
in-the-wild diving videos.
● Theoretical and Visual Interpretability:
We provide a graphical representation of learned embeddings as a
measure of pose dynamics and transitions. Furthermore, we provide
an analytical proof demonstrating that inflection points correspond
to action transition states.
8
Work Done
Objective 2
● Results:
Comparison of our model with DiveNet in detecting pose transitions.
Performance comparison with varying number of ASTGCN Blocks
Performance comparison with varying Chebyshev filter size
9
Work Done
Objective 2
● ADM Interpretation:
10
Work Done
Objective 3
● Overview[3]:
● Completely unsupervised graph-spectrum based temporal action localization. It includes the
pre-training of the ASTGCN model with reconstruction as a pre-text task. Spectral clustering is
performed to get the microlevel segmentations.
● Our approach is evaluated on the validation on the largest available 3D pose sequence dataset
having frame-level annotations, i.e. BABEL dataset.
We achieved the average precision (mAP) of almost 25 higher (with zero fine-tuning) as
compared with other unsupervised works with fine-tuning.
11
Work Done
Objective 3
● Continue…:
● The optimal number of distinct actions that could be present in pose sequences is
determined using an ensemble estimator of three cluster quality score methods.
● Perform the ablation study on the various backbone and clustering methods.
Furthermore, we provide an analytical proof that the spectral clusters derived from
the low-dimensional and reduced ASTGCN encoded embeddings are the segments
belong to the same pose dynamics. Hence, the method provides pose transitions on
the basis of clusters.
12
Work Done
Objective 3
● Results:
Detection mAP@IoU(%) across different thresholds from 0.1 to 0.5 on BABEL dataset 13
Next Work
Objective
● Tasks:
● 2D domain extension of TAL problem
● Fine-grained skeleton-based pseudo labelling
● Dataset Creation:
○ Fine-grained action detection
■ Laboratory data collection using 3D camera
■ Apply bootstrapping to label You Tube videos
● Physics Prior:
● Human Oriented Transformation
● View-Invariant Transformation
● Rotational-Invariant Transformation
● Relative Coordinate Conversion
14
References
¹Rahul Kumar†, Vipul Baghel†, Sudhanshu Singh1, Shivam Yadav3, Babji Srinivasan1, and Ravi Hegde, “Real-Time Combat Training
Analytics: Skeleton-based Temporal Action Localization in Unstructured Video, ” 2025 10th National Conference on Computer Vision,
Pattern Recognition, Image Processing and Graphics (NCVPRIG). (Under Review)
2
Bikash Kumar Badatya†, Vipul Baghel† and Ravi Hegde, “Precise Motion Transitions Detection in Untrimmed Sports Videos Using
Spatio-Temporal Graph Embeddings, ”2025 IEEE International Conference on Image Processing (ICIP). (Under Review)
3
Vipul Baghel and Ravi Hegde, “Label-Free Temporal Action Localization based on Spectral Analysis of Graph-Encoded Motion
Embeddings, ”IEEE Transaction on Pattern Analysis and Machine Intelligence. (Submitted)
4
P. Wang, F. Zeng, and Y. Qian, ``A survey on deep learning-based spatio-temporal action detection," International Journal of Wavelets,
Multiresolution and Information Processing, 2024.
5
E. Vahdani and Y. Tian, “Deep learning-based action detection in untrimmed videos: A survey,'' IEEE Transaction on Pattern Analysis
and Machine Intelligence, 2023.
6
B. Wang, Y. Zhao, L. Yang, T. Long, and X. Li, “Temporal action localization in the deep learning era: A survey,” IEEE Transaction on
Pattern Analysis and Machine Intelligence, 2023.
7
S. Guo, Y. Lin, N. Feng, C. Song, and H. Wan, ``Attention based spatial-temporal graph convolutional networks for traffic flow
forecasting,'' Proceedings of AAAI Conference on Artificial Intelligence}, 2019.
15
16