0% found this document useful (0 votes)
134 views16 pages

Deep Learning for Sports Action Detection

The document presents a thesis on Deep-Learning assisted Sports Video Analytics, focusing on motion understanding and action recognition in sports videos. It details the work done, including the creation of a novel dataset for boxing actions, the development of an unsupervised learning framework for pose dynamics, and the implementation of a graph-spectrum based approach for temporal action localization. Future work includes extending the model to 2D domains and creating fine-grained action detection datasets.

Uploaded by

Vipul Baghel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
134 views16 pages

Deep Learning for Sports Action Detection

The document presents a thesis on Deep-Learning assisted Sports Video Analytics, focusing on motion understanding and action recognition in sports videos. It details the work done, including the creation of a novel dataset for boxing actions, the development of an unsupervised learning framework for pose dynamics, and the implementation of a graph-spectrum based approach for temporal action localization. Future work includes extending the model to 2D domains and creating fine-grained action detection datasets.

Uploaded by

Vipul Baghel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

PMRF Annual Review

Deep-Learning assisted Sports Video Analytics


By
Vipul Baghel
Department of Electrical Engineering
Indian Institute of Technology Gandhinagar
Gandhinagar, Gujarat, India

Thesis Advisor:
Dr. Ravi Sadananda Hegde
Roll no: 22350005
(Electrical Engineering) PMRF ID: 1703280
Presentation Outline

❖ Journey

❖ Introduction

◆ Motion Understanding

◆Background and Motivation

❖ Work Done

❖ Future Work

❖ References

2
Timeline
Artificial Intelligence
Optimization in Machine Learning
Transformer and GNN
Writing
Nature Inspired Computing
Machine Learning
Digital Image Processing
Probabilistic Machine Learning
HSS Course Independent Project - Seminar
HSS Course

Sem 1 Sem 2 Sem 3 Sem 4 Sem 5 Sem 6 Sem 7 Sem 8 Sem 9

Aug Pre-synopsis
Qualifiers-I Qualifiers-II Proposal defense
2022 seminar and
(30 Oct - 4 (Jan) (May)
Nov) Thesis
submission
CV based Fine-grained
Design
analytics highly dynamic
sports-specific
using sota motion analytics
long-term S&S
HPE module
tracking
Detection & Classification
of fine-grained sports
Course credits completed: 43
-specific motion on wild 3
Thesis credits completed: 28 videos
Deep-Learning assisted Motion Understanding
● Definition:
This process involves multiple sub-tasks such as:
Motion understanding in computer vision refers to the
process of analyzing, interpreting, and modeling dynamic
● Motion Estimation: e.g., optical flow, trajectory prediction.
changes in visual scenes over time, with a focus on
extracting meaningful patterns of movement from
● Action Recognition: identifying specific human actions.
sequences of images or videos.

● Temporal Segmentation: detecting the start and end points


It aims to identify, localize, classify, and understand
of motion events.
temporal events and actions performed by
objects—particularly humans—in a scene.
● Pose Estimation & Tracking: localizing body joints over
time.

● Motion Pattern Modeling: learning spatio-temporal


representations of dynamic behavior.

4
Temporal Action Localization (TAL)
Work Done
Objective 1
● Overview[1]:
● A novel, well-annotated dataset is collected from 20 YouTube boxing
sparring/practice videos featuring 18 athletes (11 male, 7 female) with a
total duration of approximately 4 hours.

○ The dataset includes key body joints, punch start/end times,


and labels for six fine-grained punch categories:
Cross, Jab, Lead Hook, Lead Uppercut, Rear Hook, and Rear
Uppercut, capturing diverse techniques.

● Overall, there are 6915 detected, demarcated, and labeled boxing action
subclips in the dataset, with an average duration of 1 second.

5
Work Done
Objective 1
● Continue…
● A hierarchical framework is proposed for punch detection, demarcation, extraction, and classification
from raw combat sports videos.
The two-step approach first detects boundaries and extracts fixed-length subclips, then classifies
punches into six categories.

● The task of combat activity detection is framed as a regression problem rather than a classification
problem, where each rolling window is scored to help identify subclip frame boundaries.

● Demonstration of the utility of the proposed methodology as a home-training tool for boxers using
cost-effective consumer-grade video capture.

6
Work Done
Objective 1
● Results:

Classification performance of the proposed pipeline on the testing dataset

Use case of punch classes of multiple subjects


7
Work Done
Objective 2
● Overview[2]:
● Unsupervised Skeleton-Based Learning Framework:
We introduce an Attention-based Spatio-Temporal Graph Neural
Network (ASTGCN) encoder, pretrained via blockwise
pose-sequence learning. This approach captures blockwise motion
dynamics, and scene transitions are identified using change in the
curvature of motion dynamics sequence, introduced as Action
Dynamics Sequence (ADM).

● Empirical Validation:
Our model achieves performance comparable to supervised methods
on the DSV Diving dataset. Additionally, we demonstrate the
generalization capability of our approach on out-of-distribution,
in-the-wild diving videos.

● Theoretical and Visual Interpretability:


We provide a graphical representation of learned embeddings as a
measure of pose dynamics and transitions. Furthermore, we provide
an analytical proof demonstrating that inflection points correspond
to action transition states.

8
Work Done
Objective 2
● Results:

Comparison of our model with DiveNet in detecting pose transitions.

Performance comparison with varying number of ASTGCN Blocks

Performance comparison with varying Chebyshev filter size


9
Work Done
Objective 2
● ADM Interpretation:

10
Work Done
Objective 3
● Overview[3]:
● Completely unsupervised graph-spectrum based temporal action localization. It includes the
pre-training of the ASTGCN model with reconstruction as a pre-text task. Spectral clustering is
performed to get the microlevel segmentations.

● Our approach is evaluated on the validation on the largest available 3D pose sequence dataset
having frame-level annotations, i.e. BABEL dataset.
We achieved the average precision (mAP) of almost 25 higher (with zero fine-tuning) as
compared with other unsupervised works with fine-tuning.

11
Work Done
Objective 3
● Continue…:

● The optimal number of distinct actions that could be present in pose sequences is
determined using an ensemble estimator of three cluster quality score methods.

● Perform the ablation study on the various backbone and clustering methods.
Furthermore, we provide an analytical proof that the spectral clusters derived from
the low-dimensional and reduced ASTGCN encoded embeddings are the segments
belong to the same pose dynamics. Hence, the method provides pose transitions on
the basis of clusters.

12
Work Done
Objective 3

● Results:

Detection mAP@IoU(%) across different thresholds from 0.1 to 0.5 on BABEL dataset 13
Next Work
Objective

● Tasks:
● 2D domain extension of TAL problem
● Fine-grained skeleton-based pseudo labelling
● Dataset Creation:
○ Fine-grained action detection
■ Laboratory data collection using 3D camera
■ Apply bootstrapping to label You Tube videos

● Physics Prior:
● Human Oriented Transformation
● View-Invariant Transformation
● Rotational-Invariant Transformation
● Relative Coordinate Conversion

14
References
¹Rahul Kumar†, Vipul Baghel†, Sudhanshu Singh1, Shivam Yadav3, Babji Srinivasan1, and Ravi Hegde, “Real-Time Combat Training
Analytics: Skeleton-based Temporal Action Localization in Unstructured Video, ” 2025 10th National Conference on Computer Vision,
Pattern Recognition, Image Processing and Graphics (NCVPRIG). (Under Review)

2
Bikash Kumar Badatya†, Vipul Baghel† and Ravi Hegde, “Precise Motion Transitions Detection in Untrimmed Sports Videos Using
Spatio-Temporal Graph Embeddings, ”2025 IEEE International Conference on Image Processing (ICIP). (Under Review)

3
Vipul Baghel and Ravi Hegde, “Label-Free Temporal Action Localization based on Spectral Analysis of Graph-Encoded Motion
Embeddings, ”IEEE Transaction on Pattern Analysis and Machine Intelligence. (Submitted)

4
P. Wang, F. Zeng, and Y. Qian, ``A survey on deep learning-based spatio-temporal action detection," International Journal of Wavelets,
Multiresolution and Information Processing, 2024.

5
E. Vahdani and Y. Tian, “Deep learning-based action detection in untrimmed videos: A survey,'' IEEE Transaction on Pattern Analysis
and Machine Intelligence, 2023.

6
B. Wang, Y. Zhao, L. Yang, T. Long, and X. Li, “Temporal action localization in the deep learning era: A survey,” IEEE Transaction on
Pattern Analysis and Machine Intelligence, 2023.

7
S. Guo, Y. Lin, N. Feng, C. Song, and H. Wan, ``Attention based spatial-temporal graph convolutional networks for traffic flow
forecasting,'' Proceedings of AAAI Conference on Artificial Intelligence}, 2019.

15
16

You might also like