0% found this document useful (0 votes)

13 views17 pages

Research Paper (2) Done

Uploaded by

saimalig509

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views17 pages

Research Paper (2) Done

Uploaded by

saimalig509

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 17

INDUSTRIAL TECHNOLOGY ADVANCES

Toward human-centric deep video

understanding
WENJUN ZENG

Keywords: People oriented, Video recognition, Artificial Intelligence, Deep

learning

Abstract:
Through creating AI systems that can interpret and grasp video content in a way that is similar to
that of a human, human-centric deep video understanding seeks to close the gap between
computer vision and human perception. Deep learning-based methods currently in use overlook
the complexities of social relationships, human behavior, and emotional intelligence in favor of
object detection, action recognition, and scene interpretation. Their inability to fully comprehend
audiovisual content is hampered by this restriction. Human-centric deep video comprehension is
an effort to bridge the gap between computer vision and human perception by developing AI
systems that can comprehend and interpret video content in a manner akin to that of a human.
When it comes to object detection, action recognition, and scene interpretation, deep learning-
based techniques now in use ignore the intricacies of social interactions, human behavior, and
emotional intelligence. This limitation makes it difficult for them to properly understand
audiovisual content. Our method has applications in video analytics, social robotics, and human-
computer interaction, among other areas. We can enhance AI systems' capacity to communicate
with people, identify social signs, and offer more precise insights into human behavior by
creating systems that can comprehend video information from a human perspective. In order to
develop AI systems that are more like humans, this abstract suggests a novel approach to deep
video understanding that focuses on the subtleties of social interactions and human behavior.

Introduction:
Everyone is putting a big emphasis on artificial intelligence (AI), and it is all for a good reason.
It is worth knowing that today machines have outsmarted humans in areas like face and object
recognition, IQ tests, games, friendly speech recognition, written text comprehension and
translation, and much more. The improvement of all these technologies is attributable to three
basic pillars of technology. This entails access to ‘big data’ such as thousands of hours of
transcribed speech data and tens of millions of labeled image data. The second one is the nature
of the availability of a large computational capability like resources in terms of GPUs or, clusters
of cloud servers and so on. Along with these two, there are many other areas in ML which have
observed many advancements like deep learning and reinforcement learning. This is probably
why they could say that today we live in an era of AI.
Deep CNN solutions, for still images classification tasks, has been adopted for video
classification tasks since year 2012. However, recent advancements have introduced a new
paradigm in AI: The nearest in this family is known as Vision Transformer or in short as ViT. Or
rather, while ViTs also employ self-attention to extract such long-term contexts as CNNs, the
images themselves are processed in groups of patches rather than images. This has been applied
and some promising results have been observed from various image classification ASLs
including when it is used instead of standard CNN.
With more computing resources and especially with big data, it can in general, be scaled up and
this is why this model is useful in the field of Computer Vision. This has happened lately, as
there is a tendency to expand ViTs for other purposes such as, for instance, object detection and
semantic segmentation whereas the general development of this field gradually leans on
incorporating higher-level computer vision technologies in real-world applications. Thus, it is
safe to claim that Vision Transformer belongs to other leaders of Artificial Intelligence, as it
provides a firm basis for achieving the goals in the vast and incredibly vast area of visual
information.

Related Work:
Research findings that point to the level of appreciation, learning and thought, as well as the
number of tasks executed with minimal vision ranges from 80% to 85%. This means that in AI it
is also important to develop other forms of intelligence including the visual intelligence. All
these are as a result of advancement in the application of deep learning whereby the assessment
and identification of images has been greatly enhanced.
These approaches are a new series which are developed from the convolutional neural networks
abbreviated CNNs which were developed in 2012 and have since enhanced the performance
greatly. But there has been a new wind for change in the recent year proposing a net model
called Vision Transformer (ViT), where images are split into patches and self-attention is used to
capture the longer than patch sizes dependencies in images. This change of perspective has been
proved to be quite effective in various directions of image classification, although often such
solutions outperform rather deep CNNs.
The Vision Transformer is equally scalable with larger datasets and more computing power and
has become one of the significant trends in the Computer Vision. Using ViT’s, had an
enhancement in object detection and semantic segmentation which increases the reliability of
ViT’s. Figure one: Comparison of error rate of ViT-based systems with previous CNN-based
winners on ImageNet classification challenges.
In a paper published in 2020, Izmailov et al. have introduced the Vision Transformer use which
surpasses the traditional CNN-based systems for the ImageNet classification task. From this
explanation, it is understood that ViTs have the capability to perform better in interpreting the
images than human annotators in the same way deep residual networks did in 2015. This transfer
from traditional CNNs to Vision Transformers is yet another demonstration of how active deep
learning methodologies most likely proceed in a fairly strained tempo.
They have also been useful in cases of object detection as well as in semantic segmentation,
which also returned good results. Such developments place us on the trajectory of proposing
state-of-the-art technology built on the applicability of computer vision in various real-world
applications and giving birth to a new generation built on AI-driven visual comprehension.

Methodology:
Generalizations such as the one in the heading are that normally, if there is a great leap in one of
the two broad technological areas of still images, there will not be much time before similar
improvements are made also in the field of moving pictures. Video analytics to support
intelligent applications are well-accepted in the enterprise as well as the consumer sectors.
Unlike the more obvious area of public security, new and upcoming uses include the business
sector, home security, self-driving cars, and narration.
They also presented a relatively large number of difficulties in comparison with images in
general. It is even more challenging when one is detecting pedestrians in videos rather than in
still images as videos encompass almost all the possible content that could be conceived.
Storage, computation and communication are the major requirements that are greatly demanded
in today’s world. At times there may be a necessity to process the information as it comes in or is
provided.The general labeling of the video data is specifically expensive while in some
occasions, we do not have sufficient training data. For instance, in the case of desire to analyze
surveillance video data, access to the data is frequently problematic and the number of positive
samples may be low. Such challenges push the availability of video analytics technologies into
practical application to the limits.
Fig. 1. Performance of the winners of the ImageNet classification competitions over the years.

Thus, more sophisticated approaches appear to be used to improve video understanding

technologies due to the efforts needed to reduce their complexity and the variations within the
video material. In this context, the Vision Transformer (ViT) has found to be quite efficient in
terms of this concern. The models that incorporate ViTs with self-attention mechanisms are also
higher in power than the CNN and LSTM models to capture the nature of both spatial and
temporal features from the given datasets of videos.
Vision Transformers can accept sequences of image patches as input, therefore, Vision
Transformers are useful for video processing when temporal dynamics analysis is considered.
They are able to handle big datasets and large amounts of computations hence fulfilling the big
data and computational requirements. Furthermore, because the ViTs are good at the feature
extraction step, they can be fine-tuned with the few labeled samples as a result of the limitation
of labeling cost and the availability of small training data.
This novel approach based on Vision Transformers can immensely contribute to the future
advancements in video understanding due to the significance of the following fields: real-time
video analysis, self-driving vehicles, and multi-tier security cameras. Thus, the focus on
implementing human-centric features complemented with the strengths of ViTs can help address
many of the fundamental issues relating to video understanding, opening up avenues for the
development of sophisticated and practical video analytics.

HUMAN-CENTRIC: WHY AND HOW?

Individuals really form the core, soul, or essence of our every day living and our organizational
practice. To serve people better, we need to understand them more deeply: them and their
environment, and information about their identity, activities, actions, emotions, plans and
purpose of a conversation, and interlocutor. As people are always searching for themselves and
searching for other people, machines also require effective guidance modes to enable them to
perceive people through multi-sensory information as the intelligence of society continues to
rise.
As expected, people appear in most of the contents of the videos, a key area of human hybrid
intelligent computing is human –computer interactions which has been almost continuously
likely to remain the most dominant in practical applications in the near –term. That is why
human intervention takes an important position while analyzing the given video and interpreting
its content. Recognizer is possibly one of the first successful applications of computer vision
where it tackled face intification or recognition. As far as the next discovery is concerned, it is
perhaps not impossible that it may emerge from the general human understanding technologies.
Taking all into account, further video comprehension it is reasonable to go with the human-
oriented one as the result. When we talk about [the] ‘‘human-centric ’’ aspect, we are going to
include the tasks of human detection in video and design principles based on the constitution of
brain circuitry, albeit the current state is not so advanced.
The original intention of AI was to do exactly what this model was intended to do, recreate the
functioning of the human brain. It is important to understand the human from the various facet
and fields such as Biology, Neuroscience, cognitive sciences, behavioral sciences, and social
sciences. In this context, one perhaps the most promising technique for video understanding is
deep learning. Among the groups, Vision Transformers (ViTs) have now come up as a tool of
utilizing self-attention mechanisms that mimic the brain of a human. This mechanism has been
integrated into some of the design of neural networks that allow a model to pay more attention to
some features or the relationships with data.
Specifically, ViTs are quite impressive to capture spatial and temporal patterns, which are
significantly helpful and useful in human-orientated activities such as action recognition,
emotion detection, and human interaction analysis. Therefore, through the employment of
attention models that mimic human cognitive paradigm, Vision Transformers have presented
themselves as fitting models to analyze detailed data from videos.
Further, the argument of human beings is grounded on the fact that one is bound to acquire
knowledge in the process of reasoning more than anything else. Nowadays, it has been realized
that there has been more enhancement of knowledge in data savvy methodologies in deep
learning system designs. This integration makes it easier to enhance the capacity of the AI
systems in analyzing individuals actions and actions in frames of videos well.
In the next section, the authors described the human-centric vision tasks and explored the latest
status of them, the challenges and the current limitations of how the authors’ imperfect
understanding of the human brain working mechanisms, such as attention mechanisms, semantic
models, and knowledge-based reasoning, can be applied to address some of these challenges
more effectively.

Results:
Object recognition in a video depends on the identification of people and their characteristics and
what they are doing in the scene. Much has been achieved in many fundamental vision tasks with
the focus on human centric applications. The below gives an outline of some of these
technologies as follows.

A) People Tracking
Computation of Visual Object Tracking is one of the universally important issues discussed
under video analysis and understanding. It is the job of a tracker to locate the target object within
any of the frames of a video once the bounding box of the target object has been provided in the
first frame of the video. We can generally regard single object tracking as a detection tracking
problem because in essence, it involves tracking detection of this particular object from one
frame to the other. The major challenge lies in the fact that at the same time, it is necessary to
respond to two opposite, although partly contradicting, demands – on the one hand, the focus on
the ability to identify unfamiliar classes, on the other hand, the ability to distinguish between the
classes already seen. Here, robustness comes into the picture when a tracker should not lose track
of the target even if there is variation in illumination, the target is in motion or change in view
angle or object deformation. On the other hand, in the case of a tracker, it has to segregate the
target object from another similar object plus, or a number of objects in the background, etc.
Both are conventionally managed through online training that strives to attain flexibility.
Since 2015, there has been some research where authors employ CNNs in their work. Although
deep features are disadvantageous in the sense that they slow down the speed of online training
in some way, this very advantage of deep features allows one to completely eliminate online
training. The first work in this regard is SiamFC where the Siamese CNNs are utilized to extract
the features of the target and the search region and followed by conventional cross-correlation
layer which provides a efficient sliding window mechanism. Therefore, SiamFC achieves the
working frame rate of 86fps on the GPU side. Though the basic concept can be attributed to this
paper, there are many follow-up works like SA-Siam and SiamRPN.

Fig. 2. Accuracy-speed trade-off of top-performing trackers on the OTB-100 benchmark. The speed
axis is logarithmic. Reproduced from Fig. 8 of [7]. Please refer to [7] for the notations of different
trackers. “Ours” refers to the SPM-Tracker [7].
Most of the object tracking work which utilize SiamFC has a single-stage architecture while a
two-stage SiamFC-based network has been introduced to solve for robustness and
discriminativeness. Above all, this paper investigates two stages: the first coarse matching stage
aims to improve robustness and the second fine matching stage aims to increase the
discrimination power by replacing cross-coritation layer with more elaborate distance learning
subnetwork. These results are statistically sound and the tracker operates at a phenomenal 120
framerates based on an NVIDIA P100 GPU. Some of these comparisons have been highlighted
in Figure 2 which illustrates the statics of the benchmark OTB-100 to show the best trackers with
high accuracy and low speed.
In various real-life situations, the need appears to track several individuals at once. In general,
such an approach is not very efficient, when multiple-person tracking is solved as a sum of
multiple single-person tracking problems. Modern multi-person tracking approaches rely on the
tracking-by-detection paradigm, which means using a general object/person detector to detect
objects (of the same target class, which is person) in distinct frames and then connecting the
detections. These include; importance sampling and particle filtering for the propagation of the
state in a Bayesian way, linking the short tracks over long time using the Hungarian Algorithm
for the optimal assignment and the greedy Dynamic Programming in which the trajectories are
put one step at a time. Recent studies to increase the reliability of wrong identity assignment
have posed the task of connecting detections for a longer period using optimization techniques.
A typical method in multi-person tracking problem is the use of constrained flow optimization
they use the k-shortest paths method. Other related approaches include graph-based formulations
of the minimum cost multicut problem.
The work compared the tracking-by-detection approach that separates tracking from detection to
a tracking-by-detection approach even though the former may not be the most efficient method, a
lot more research should be dedicated to joint detection and tracking approaches applied to
multiple-object/people tracking. Multiple-Object/Pet Multiple People Interactions: More use of
spatial and temporal structures needs to be made; the balance between the complexity and
accuracy has to be found. For long-term tracking, it becomes even more complex and often
provides intermediate short-term tracklets, and completing linking/matching tracklets, over time,
it becomes necessary, for instance, by using object re-identification (re-ID) methods.
The basic of people tracking has been revolutionized by the Vision Transformer (ViT) since it
uses self-attention display that can tackle spatial and temporal view. ViTs can input sequences of
the image patches, which makes the network ideal for analyzing videos where temporal
menageries are crucial. They easily handle large amounts of data and resources and provide
strong solutions for solving complex tracking problems. In addition, ViTs can be fine-tuned with
lesser amounts of supervised data because they are less sensitive to the degradation associated
with expensive labeling and limited data samples.
For fanatical NHPs detecting technologies, ViTs can be implemented to meet occlusion and body
deformation on purpose. It is possible to extend the hypotheses about the use of the ViT
approach in the creation of more complex re-ID architectures in the future, especially given the
possibility of matching hypotheses over increasingly significant temporal distances between the
tracks, along with the integration of holistic deep feature representation together with the
extracted body pose layouts. This in turn improves the possibility of tracking a person as
efficiently throughout the longer periods and extension of various conditions. Further, specific
discourse on person re-ID will be carried out in Section III. C.

B) Human Pose Estimation

Human pose estimation deals with determining the position of features in an image, which are
corresponding to specific body joints in a human figure. Human body is of course an important
application; and it has large number of applications in many fields including human action
recognition, motion analysis, activity analysis and human computer interaction. Nonetheless,
progress there is still a very challenging task, and despite the fact that many approaches have
been already proposed, and some techniques have got particularly much attention during the last
several years, there are a number of factors which challenge anybody who tries to develop a
successful pose estimator: poses, shapes, views, sizes, dependencies between those parts,
appearances, and qualities of images.
The pictorial structure is one of the earlier works that instantiate deformable configurations as all
the parts are connected to each other by springs to model complex joint relationships between the
pairs of parts. As for subsequent works, this concept is generalized to CNNs. Most of the recent
methods in this type of framework train CNNs to learn feature representations, as well as to get
the positions of the 2D joints or the score maps of the 2D joints. Some methods combine the
learned feature representations to directly regress the 2D joint positions Some methods estimate
a score map for each 2D joint based on architectures fully CNN is a well-mapped way. For
effective sampling of the multiple people poses in 2D, some methods use Part Affinity Fields to
train to link body parts to people in the image. The architecture incorporates the global context
aggregation, making it possible to implement greedy bottom-up process of parsing while using
real time speed and without losing its high accuracy even when the number of people shown in
the picture is rather great. Body normalization first and then limb normalization reduces the
movement variance of the relative joint locations, which benefits the learnability of convolution
spatial models and improves the pose estimation accuracy, on the downside, it is computed in the
3D space leading to higher memory usage.
In addition to the 2D pose estimation, the current focus has been on relative 3D pose estimation
from monocular images and many are working with regression models. It also revealed that the
heat map representation and the joint regression could be connected and could have enhanced
performance with the help of replacing the “taking-maximum” operation with the “taking-
expectation” integral operation to get the benefits of each of the presented methods. Another
practically important problem is to determine the absolute 3D pose of the human from multiple
calibrated cameras. Most of the previous studies employ a process in which at first the 2D joints
coordinate is detected in each view of camera and then the coordinate is triangulated to 3D view.
Accordingly, it can be claimed that the 3D pose estimate used within the scope of this study is at
least as accurate as the 2D joint estimates used. These multi-view features are then fused by the
proposed cross-view feature fusion to achieved better 2D pose estimation accuracies for each
view particularly for joints that may be occluded. Then Recursive Pictorial Structure Model
(RPSM determines 3D poses from multi-view estimations of 2D poses. This approach is the first
to introduce a new approach for multi-view human pose estimation using the H36M benchmark
which the authors demonstrated provide better performance than prior work, while using roughly
less than half the mean joint error. This advancement made the possibility of having no motion
sensors or markers almost real, it opens up opportunities like low-cost 3D motion capture of
athletes to analyse their body performance or monitoring how people walk and the actions they
perform in a retail shop.
Simply, ViT has taken the human pose estimation and other areas related to it to another level.
Self-attention based models are good when it comes to extracting local and global features from
the input images which is resourceful while working with many complex spatial relations
involved with human pose estimation. ViTs feed images as patches; therefore, there are learning
advantages of contextual relations and dependency of joints across different levels in ViT
compared to CNNs. It help in increasing such steadiness and preciseness of estimation of
position in the case of an occlusion, or shift of viewpoint.
For the 2D pose estimation, ViTs are efficient when predicting spatial structures; however,
scaling and angels of human figures are not an issue. The other component of the PAF strategy is
the use of contextual information, especially in the multiple people setting where indications of
the observed body parts assist in attributing them to a specific individual. In the 3D pose
estimation, ViTs also benefit from the multi-view data, and the architectures in the attendance
mechanism can choose certain views to focus on the particular features and are less affected by
the occlusion and errors in the 2D joints. As a result, more accurate 3D pose estimates are
obtained, and this is equivalent to the realization of the main goal of marker-less and high-
accuracy motion capture. Therefore, the adoption of ViTs will aid in solving some of the
headaches inherent in inhuman poses estimation, and aim at improving the solutions in different
fields of human pose estimation.

C) Person re-identification
Person re-identification (re-ID) using Vision Transformers (ViT) involves finding a specific
person from the multiple camera angles or times, or from the same angle but at different time.
This task is considered difficult due to changes in person pose, viewpoint, detection, background,
occlusion, and lighting. Subsequent work has been devoted to addressing these problems by
building upon what ViT excels in addressing, which are spatial misalignment and semantic
alignment.
One proposed solution is to use ViT’s architecture to split the person image into non-overlapping
patches, which allows to learn densely semantically aligned features. This makes it possible for
semantically aligned<|reserved_special_token_252|> model construction and analysis of
semantically aligned feature learning. To tackle with the issues like errors and the difficulties in
handling the non-overlapping areas, a learning paradigm can be introduced to learn the features
with the help of another stream.
Otherwise, an encoder-decoder structure can be applied: On the contrary, the encoder employs
ViT to acquire the re-ID features of the input image, and the decoder synthesizes a 3D full-body
texture image in the canonical semantic space. This ensures that the learned features are invariant
to view and pose, thus explaining the elimination of Figure 5 visible body discrepancies in
matching images. The decoder is only employed during training while it does not contribute to
the model’s complexity during the testing phase. Integrating person re-ID in ViT holds
promising as it can take advantage of the approach and detect space and hierarchical components
besides the semantic alignment.

D) Human action recognition

Human action recognition is a significant application in activity analysis that aims to determine
human motion. In this regard, different actions and slight variations in them make attention
mechanisms equally important in describing actions. Variation across views is another critical
problem, which requires view-invariant solutions. We pay a particular attention to the skeleton-
based techniques using the high-level representation of body.

Fig. 3. Vision transformer process explanation

Considering the fact that human attention goes to the main ideas in the content, for temporal
analysis, we utilize Vision Transformers (ViT) and recognize co-occurrence patterns of
unification joint features. As the result of the deep learning, for the computer vision task, the
patch-based architecture and self-attention mechanism in ViT can aid in the feature extraction
and weighting. For this, we introduced the spatial attention weights for ViT for joints in each
frame of the action sequence, and temporal attention weight curve for the action prediction
output.
The main idea of adaptation assumes changing view variation by using view adaptation layer that
takes advantage of ViT learning hierarchical spatial organization and semantic correspondence.
It also reconstructs the input 3D skeleton sequence based on the input action clips to be a
consistent view, while aiding the main action classification network in focusing more on action
details regardless of the views.

Fig. 4. Illustration of a retail intelligence scenario where multiple cameras are deployed, 3D space is
reconstructed, people are detected and tracked, and heat-map (in purple) is generated.

Thus, the proposed modifications to ViT, primarily the use of attention mechanisms and view
adaptation, have shown a proper impact as we obtained up to 6% absolute improvement on a
benchmark dataset than the basic ViT which emphasizes the further development of attention
mechanisms in vision tasks. The described approach allows machines to recognize actions in
learned views that are disturbed by view variation, which is similar to human abilities.

Conclusion:
That is why, for a better and comprehensive understanding of videos and video sequences,
human-oriented vision tasks must be incorporated. This is the area in which the system
perspective is most helpful, as one can effectively combine advantages in one BB and
constructively address possible risks in others. Depending on the application, a practical system
may integrate some or all of the following: Some applications include person detection and
tracking, re-identification, pose estimation, action recognition, detection of heatmaps , etc.
For instance, in a retail intelligence with several cameras as depicted in fig. 4, there may be the
need to track or recognize a customer through face detection, body only detection, bones only or
both. The longitudinal linkages are important in joining tracklets of the similar time period and
the same person across different cameras through cohort analysis and person re-ID, respectively.
This way , heat maps might be created while tracking the people and to know more about the
customer. It’s at this point where we can use the estimated pose sequences or even pose and
RGB data to identify more of the detailed activities.
Hence, for efficient fusion, one can make use of Vision Transformer (ViT) which can be
integrated into a multi-task learning approach by sharing the feature extraction net. This creates a
clear possibility of feature extraction plus, it also offers a chance of processing different tasks
simultaneously.
Special consideration has to be paid to dependency constraints when dealing with real time
interactive application like Video teleconferencing is very demanding in terms of size can be as
low as 100k bytes and also very strict on speed in terms of ms per frame. Therefore, the task
should be shifted to the reduction of model size, and the methods that are focused on accelerating
the ViT-based models include knowledge distillation, model pruning, and quantization. Hence,
incorporating ViT with other human-centric vision tasks will be able to give a complete outlook
of the video comprehension and yield a more topnotch system level result.

FUTURE PERSPECTIVES
I put emphasis on the human understanding in the context of videos because it is critical for
videos comprehension. However, there has been notable recent advances in vision tasks with a
focus on human in the loop, however, majority of such advancements only address individual
component tasks and do not take into consideration inter relations of the entities and the cause
effect relationships between the actions and the events. This is partially caused by the fact that to
training complex tasks, the amount of labeled data must grow at an exponential rate. To this
effect, human knowledge is sacrosanct and should be implemented in the learning systems so as
to minimize the dependence on data driven approaches to learn such semantic relationships.
Fortunately, some attempt has been recently made to introducing the human knowledge into the
models like graph convolutional networks (GCNs) and the Symbolic Graph Reasoning (SGR)
layers for the enhanced video understanding. Similarly, there is an absurd in using semi-
supervised and unsupervised learning technologies like Vision Transformers (ViT) pre-trained
on unlabelled datasets.
While there are approaches to solving some aspects of actual human usage, such as using
attention mechanisms in some degree, there a great deal that is still unknown about the human
brain. It means that by developing a human-centric approach which is based on a rather simple
idea: one has to grasp the humans and use this knowledge to improve the video understanding
technologies, one will be moving in the right direction.
While the advancement in the development of research has been recorded, the actual practical
application of the development has been more of a slow process. However, Industrial leaders and
start-ups are navigating to technologies to markets through the various GO to Market Areas of
Application such as Retail analytics, smart care of aging population, and smart security. In all
these cases, machine intelligence is required and one day in the near future, such technologies
are expected to advance and promote these uses.
However, Vision Transformers (ViT) that are built on the transformer architecture incorporate
transformer models into various vision tasks and perform quite well. To achieve these goals, it is
important to incorporate human knowledge so as to augment the advantages of the model while
utilizing ViT for the enhancement of humans perception in videos and the realization of practical
applications.

ACKNOWLEDGEMENT
The author would also like to express his gratitude to all his colleagues and interns at Microsoft
Research Asia for numerous valuable and inspiring discussions that have influenced this work
and the opinions and ideas expressed in this paper. Cuiling Lan and Chong Luo specially are
thanked for their consistent inter-disciplinary cooperation when dealing with the specific
technical details as described in the paper in reference to Vision Transformers (ViT) work; while
Xiaoyan Sun and Chunyu Wang are thanked because they also work on similar research areas.
The high-quality and constructive comments and suggestion – these points offered by them were
quite valuable for improving the author’s impression and valuable experience human oriented
video analysis.

FINANCIAL-SUPPORT
To the best of the authors’ knowledge, this study does not have any Current or Planned Sources funding;
commercial or not-for-profit.

CONFLICT OF INTEREST
None

Reference:
Li B. ; Yan J. ; Wu W. ; Zhu Z. ; Hu X. : High performance visual tracking via siamese region proposal
network arXiv:1606. 02281, 2016 in Proc. of the IEEE Conf. S. Kamvar et al. , ‘feeler: A Context-aware
and Content-private Social Recommender System,’ Proc. of the 7th ACM Conference on Recommender
Systems, New York, 2013, 169–176.
Wu Y. ; Lim J. ; Yang M. -H. : The method under consideration was described in detail in the paper
Benchmarking on line object tracking, in Proc. of the IEEE Conf. Publication entitled ‘A Novel Approach
of Adaptive Scaling for Histogram Equalization with Application to Image Enhancement’ published in the
Computer Vision and Pattern Recognition Conference in Portland in 2013, page 2411 – 2418.

Giebel J. ; Gavrila D. ; Schnorr C. : A Multi-cue Bayesian approach for 3D object tracking at European
Conference. : Proceedings of the First International Conference on Computer Vision, Prague, 2004.

Perera A. ; Srinivas C. ; Hoogs A. ; Brooksby G. ; Wensheng H. : Tracking Multiple Objects in Presence Of

Long Periods of Occlusion and Cases Where Targets Are Split & Merge, in Proc. of the IEEE Conf. Spatio-
temporal Motion Analysis , In Proc. Computer Vision and Pattern Recognition, New York, June 2006 666–
673.

Fleuret F. ; Berclaz J. ; Lengagne R. ; Fua P. : Aegis: A system for on-the-fly construction of large-scale
multi-camera people tracking using probabilistic occupancy map, IEEE Trans. Pattern Anal. Mach. Intell. ,
2 (2008), 267–282

Berclaz J. ; Fleuret F. ; Turetken E. ; Fua P. : Multiple object tracking application using k-shortest paths
optimization. IEEE Trans. Pattern Anal. Mach. Intell. , 9 (2011), 1806–1819.

Ristani E. ; Tomasi C. : Stalking several individuals on the internet as well as the possibilities to track
people in the real time in Asian Conf. in the field of Computer Vision, Singapore, October 14-17 2014.

Tang S. ; Andres B. ; Andriluka M. ; Schiele B. : Multi-target tracking using subgraph decomposition, in

Proceedings of the. of the IEEE Conf. Conference Proceeding, IEEE Conference on Computer Vision and
Pattern Recognition, Boston, 2015.

Tang S. ; Andriluka M. ; Milan A. ; Schindler K. ; Roth S. ; Schiele B. : Learning people detectors for
tracking in crowded scenes: in Proc. of the IEEE Int. Conf. conference on Computer Vision, Sydney, 2013.

Tang S. ; Andriluka M. ; Andres B. ; Schiele B. : Multiple people tracking by lifted multicut and person
reidentification, in Proc. of the IEEE Conf. In Computer Vision and Pattern Recognition, Honolulu, 2017.

Yang Y. ; Ramanan D. : Flexible mixtures-of-parts for articulated pose estimation, in Proc. of the IEEE
Conf. at Conference on Computer Vision and Pattern Recognition, Colorado Springs, 2011.

Chen X. ; Yuille A. : Graphical model based articulated pose estimation with image dependent pairwise
relations, in NIPS’14, Montreal, December 2014

Yang W. ; Ouyang W. ; Li H. ; Wang X. : End-to-end learning of deformable mixture of parts and deep
convolutional neural networks for human pose estimation : in Proc. of the IEEE Conf. in Computer Vision
and Pattern Recognition that was held in Las Vegas in year 2016.

Toshev A. ; Szegedy C. : DeepPose: human pose estimation via deep neural networks, in Proc. of the IEEE
Conf. Computer Vision and Pattern Recognition, Columbus, 2014.

Wei S. -E. ; Ramakrishna V. ; Kanade T. ; Sheikh Y. : Convolutional pose machines, in Proc. of the IEEE
conference on computer vision and pattern recognition (CVPR), 2014. of the IEEE Conf. Conference
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, June 27-
July 2, 2016.
Newell A. ; Yang K. ; Deng J. : Stacked hourglass networks for human pose estimation, in European Conf.
conference: Computer Vision, venue: Amsterdam, year: 2016.

Cao Z. ; Simon T. ; Wei S. ; Sheikh Y. : Realtime multi-person 2D pose estimation using part affinity fields
>, in Proc. of the IEEE Conf. DOI: 10. 1109/cvpr. 2017. 199

Sun K. ; Lan C. ; Xing J. ; Wang J. ; Zeng W. ; Liu D. : Human pose estimation using global and local
normalization, in Proc. of the IEEE Int. Conf. In: 2017 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2017, Seattle, vol. Proceeding of the IEEE Conference on Computer Vision, Venice,
Italy, 2017.

Martinez J. ; Hossain R. ; Romero J. ; Little J. J. : “A simple yet effective baseline for 3D human pose
estimation,” presented at Proc. of the IEEE Int. Conf. Published in the IEEE Conference on Computer
Vision and Pattern Recognition, Venice, Italy on June 25, 2017.

Moreno-Noguer F. : Single Image Based Estimation of 3D Human Pose Through Distance Matrix
Regression, in Proc. of the IEEE Conf. System for Image Analysis and Object Recognition In: Proceedings
of CVPR, Hawaii, 2017, pp: 61-70.

Sun X. ; Xiao B. ; Wei F. ; Liang S. ; Wei Y. : Human pose regression by integration in: Proceedings of the
European conference on computer visionat conference entitled Computer Vision, in Germany, Munich,
2018.

Hartley R. ; Zisserman A. : This paper focuses on the Multiple Images in terms of Geometry in Computer
Vision. Published by Cambridge University press, Cambridge, in the year of 2003.

Amin S. ; Andriluka M. ; Rohrbach M. ; Schiele B. : Multiview pictorial structures for 3D human pose
estimation, in British Machine Vision Conf. , Bristol, 2013.

Qiu H. ;Wang C. ;Wang J. ;Wang N. ; ZengW: Cross view fusion for 3D human pose estimation, In: CROSS
VIEW FUSION FOR 3D HUMAN POSE ESTIMATION Conference on Computer Vision and Computer
Graphics, exhibition.

International Joint Conference on Computer Vision and Computer Graphics, exhibit of the IEEE Int. Conf.
In October we have successfully conducted the international Meeting on Computer Vision in Seoul,
Korea-2019.

Ionescu C. ; Papava D. ; Olaru V. ; Sminchisescu C. : Humanitarian 3. 6 m: state of the art big data and
machine learning methodologies that are helpful in employing batch and continuum data with the help
of predictive models for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Intell. , 7,
1325-1339.

Tome D. ; Toso M. ; Agapito L. ; Russell C. : Multi-stage Refinement and Recovery for Markerless Motion
Capture in Int. Conf. In 3D Vision, Verona, Italy, May 23 – 25, 2018.

Wang X. : Current state of research in multiple camera video surveillance. Pattern Recognit. From the
Letters point of view: Thus, a qualitative approach, Journal of Letters, 34(1), p. 3-19, published in 2013.

Varior R. R. ; Shuai B. ; Lu J. ; Xu D. ; Wang G. : This paper named A Siamese long shortterm memory
architecture for human re-identification was published in European Conf. , [3] S. Ioffe and C. Szegedy,
“Batch Normalization: Accelerating Deep Network Training. ”, In the proceeding of the IEEE Conference
on Computer Vision and Pattern Recognition, Amsterdam, July, 2016.

Su C. ; Li J. ; Zhang S. ; Xing J. ; Gao W. ; Tian Q. : A/Pose-driven Deep Convolutional Model for Person Re-
identification, [accepted to appear at], in: of the IEEE Int. Conf. Based on conference on Computer Vision
that was held in Venice, Italy in 2017.

Suh Y. ; Wang J. ; Tang S. ; Mei T. ; Lee K. M. : In the European Conf. , bilinear models with part-aligned
convolution have been presented for person re-identification. Sixth International Conference on
Computer Vision: Proceedings of the International Conference on Computer Vision (ICCV), Munich,
Germany 2018, October.

Cheng D. ; Gong Y. ; Zhou S. ; Wang J. ; Zheng N. : Person reidentification based on multi-channel CNN-
parts with improved triplet loss function in Proc. of the IEEE Conf. presented in the Computer Vision and
Pattern Recognition Conference that was held in Las Vegas 2016.

Wang G. ; Yuan Y. ; Chen X. ; Li J. ; Zhou X. : Person Retracking for Multi- Camera Tracking with Multiple
Granularity Representation, ACM Multimedia 2018, Seoul, Korea.

Li D. ; Chen X. ; Zhang Z. ; Huang K. : Deep Context-Aware Features over Body and Latent
Parts for Person Re-Identification at Learning, in Proc. of the IEEE Conf. Computer Vision and
Pattern Recognition in Computer Science, Honolulu, 2017.
Zhang Z. ; Lan C. ; Zeng W. ; Chen Z. : In contrast, densely semantically aligned person re-
identification, which was introduced at the IEEE Conf. Proceeding of the IEEE Computer Vision
and Pattern Recognition, Long Beach, June 2019. 45Guler R. A. ; Neverova N. ; Kokkinos I. :
Densepose: dense human pose estimation in the wild, in Proc. of the IEEE Conf. in Computer
Vision and Pattern Recognition taking place in Salt Lake City in 2018.
Jin X. ; Lan C. ; Zeng W. ; Wei G. ; Chen Z. : This paper, Semantics-aligned Representation
Learning for person Re-Identification, was published at the AAAI Conference on Artificial
Intelligence held in New York in 2020.
Weinland D. ; Ronfard R. ; Boyerc E. : This paper reviews vision-based approaches to
representing, segmenting and recognizing action. Vis. Image. Underst. , ISSN 1152-7417, vol.
115, no. 2 (2011), pp. 224-241
Simonyan K. ; Zisserman A. : Convolutional two-stream networks for action recognition in
videos, in NIPS’14, Proceedings of the main conference and the workshops of the Twenty-
Seventh Conference on Computational Learning Theory, Montreal, Canada, June 2014, 568-576.
Tran D. ; Bourdev L. ; Fergus R. ; Torresani L. ; Paluri M. : Distributed Representation for
Spatiotemporal Features Learning with 3D Convolutional Networks for IEEE Int. Conf. the
International Conference on Computer Vision, Santiago, Chile, 6-12 December, 2015.
Feichtenhofer C. ; Pinz A. ; Wildes R. : This paper employs spatiotemporal residual networks for
segment-level video action recognition, InAdvances in Neural Information Processing Systems,
Barcelona, PP 3468- 3476, 2016.
Wang L. et al. Temporal segment networks: towards good practices for deep action recognition,
In European Confin On Computer Vision held in Amsterdam in June 2016, pp. 20–36.
Qiu Z. ; Yao T. ; Mei T. : with pseudo 3D Residual Networks, in proc. of the IEEE Conf.
conference on Computer Vision and Pattern Recognition in Honolulu in 2017, pp. 5533–5541.
Zhou Y. ; Sun X. ; Zha Z. ; Zeng W. : MiCT: mixed 3D/2D convolutional tube for human action
recognition, in Proc. of the IEEE Conf. Proceeding on Computer Vision and Pattern Recognition,
held in Salt Lake City, 2018.
Du Y. ; Wang W. ; Wang L. : For skeleton-base action recognition, the proposed hierarchical
recurrent neural network in the IEEE Conf. Invited Talk at Conference on Computer Vision and
Pattern Recognition, Boston, pp 1110-1118, 2015.

2019 Book ComputerVisionACCV2018Workshop
No ratings yet
2019 Book ComputerVisionACCV2018Workshop
549 pages
CV
No ratings yet
CV
48 pages
ViViT: A Video Vision Transformer
No ratings yet
ViViT: A Video Vision Transformer
14 pages
UNIT 3 MAKING MACHINES SEE
No ratings yet
UNIT 3 MAKING MACHINES SEE
27 pages
Visual Content Indexing and Retrieval With Psycho-Visual Models
No ratings yet
Visual Content Indexing and Retrieval With Psycho-Visual Models
276 pages
Fritzbox-7590 Man en GB
No ratings yet
Fritzbox-7590 Man en GB
317 pages
TheFormulaforMiracles Part1
100% (3)
TheFormulaforMiracles Part1
114 pages
Grade 8 Term 1 Pretechnical Lesson Plans
No ratings yet
Grade 8 Term 1 Pretechnical Lesson Plans
99 pages
Domains of Ai
No ratings yet
Domains of Ai
8 pages
Challenging Task[1]
No ratings yet
Challenging Task[1]
21 pages
Ai Lakshmana Sai Vision Transformer
No ratings yet
Ai Lakshmana Sai Vision Transformer
19 pages
A survey of the Vision Transformers and its CNN-Transformer based Variants_Khan et al_
No ratings yet
A survey of the Vision Transformers and its CNN-Transformer based Variants_Khan et al_
82 pages
2012.12556
No ratings yet
2012.12556
23 pages
Government College of Engineering Aurangabad: Submitted BY
No ratings yet
Government College of Engineering Aurangabad: Submitted BY
22 pages
Computer Vision Lecture 1
No ratings yet
Computer Vision Lecture 1
15 pages
Project Report ATM With My Observations
No ratings yet
Project Report ATM With My Observations
32 pages
ViTA_A_Vision_Transformer_Inference_Accelerator_for_Edge_Applications
No ratings yet
ViTA_A_Vision_Transformer_Inference_Accelerator_for_Edge_Applications
5 pages
Abstract
No ratings yet
Abstract
2 pages
AI-Powered Visual Sensors and Sensing: Where We Are and Where WeAreGoing
No ratings yet
AI-Powered Visual Sensors and Sensing: Where We Are and Where WeAreGoing
17 pages
Roadside Video Data Analysis Deep Learning 1st Edition Brijesh Verma - Download the ebook now and own the full detailed content
No ratings yet
Roadside Video Data Analysis Deep Learning 1st Edition Brijesh Verma - Download the ebook now and own the full detailed content
58 pages
Integrating Spatial and Temporal Dependencies
No ratings yet
Integrating Spatial and Temporal Dependencies
6 pages
Supported Devices
No ratings yet
Supported Devices
188 pages
Assignment Transforming Computer Vision The Rise of Vision Transformers and Its Impact
No ratings yet
Assignment Transforming Computer Vision The Rise of Vision Transformers and Its Impact
3 pages
Deep Learning For Video Understanding 2024th Edition Zuxuan Wu instant download
No ratings yet
Deep Learning For Video Understanding 2024th Edition Zuxuan Wu instant download
58 pages
CAGuidelines
No ratings yet
CAGuidelines
16 pages
Is Assignment
No ratings yet
Is Assignment
5 pages
Ai in Video Analytics en US 266748
No ratings yet
Ai in Video Analytics en US 266748
16 pages
Visual GPT..
No ratings yet
Visual GPT..
7 pages
1 Introduction to Computer System-ppt
No ratings yet
1 Introduction to Computer System-ppt
42 pages
Schunk EGP 44
No ratings yet
Schunk EGP 44
48 pages
CV_Lecture_1-DD-Don
No ratings yet
CV_Lecture_1-DD-Don
38 pages
Smart Computer Vision
No ratings yet
Smart Computer Vision
358 pages
SLM2
No ratings yet
SLM2
32 pages
An Overview of Vision Transformers For Image Processing A Survey
No ratings yet
An Overview of Vision Transformers For Image Processing A Survey
17 pages
03 - ViViT - A Video Vision Transformer
No ratings yet
03 - ViViT - A Video Vision Transformer
13 pages
ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
Computer vision ppt
No ratings yet
Computer vision ppt
45 pages
Gaurav_Vision_Transformer
No ratings yet
Gaurav_Vision_Transformer
10 pages
The Quiet Revolution in Machine Vision
No ratings yet
The Quiet Revolution in Machine Vision
19 pages
Huawei Cloud AI Video Service Technical White Paper
No ratings yet
Huawei Cloud AI Video Service Technical White Paper
36 pages
trace_2023-12-09 13_10_02 748
No ratings yet
trace_2023-12-09 13_10_02 748
19 pages
good note - ViT
No ratings yet
good note - ViT
13 pages
Human Activity Detection Using Deep - 2-1
No ratings yet
Human Activity Detection Using Deep - 2-1
8 pages
Applsci 13 05521 v2
No ratings yet
Applsci 13 05521 v2
17 pages
IJCRT24A5009
No ratings yet
IJCRT24A5009
5 pages
ConvNeXt - A ConvNet For The 2020s
No ratings yet
ConvNeXt - A ConvNet For The 2020s
15 pages
Introduction To Junit5 Tags
No ratings yet
Introduction To Junit5 Tags
21 pages
10 Transformers
No ratings yet
10 Transformers
22 pages
ServoTough OxyExact 2200H Oxygen Transmitter Installation Manual 2222005a 9 1 1
No ratings yet
ServoTough OxyExact 2200H Oxygen Transmitter Installation Manual 2222005a 9 1 1
46 pages
HPE_sd00002413en_us_HPE Alletra Storage MP B10000_ Replacing an interface card
No ratings yet
HPE_sd00002413en_us_HPE Alletra Storage MP B10000_ Replacing an interface card
14 pages
GC 2024 06 19
No ratings yet
GC 2024 06 19
20 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
Unit 2 CN 4TH Sem
No ratings yet
Unit 2 CN 4TH Sem
24 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
6 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
Computer Vision in Aritificial Intelligence
No ratings yet
Computer Vision in Aritificial Intelligence
33 pages
Evolution_of_AI_MLA_with_Works_Cited
No ratings yet
Evolution_of_AI_MLA_with_Works_Cited
8 pages
Computer Vision (1) (2)
No ratings yet
Computer Vision (1) (2)
14 pages
Transformers_in_computational_visual_media_A_surve
No ratings yet
Transformers_in_computational_visual_media_A_surve
30 pages
REV Group6. E Invoicing Management System With Mobile QR Code Payment Confirmation System
No ratings yet
REV Group6. E Invoicing Management System With Mobile QR Code Payment Confirmation System
19 pages
BCS100 Height Controller UserManual V3.22
No ratings yet
BCS100 Height Controller UserManual V3.22
43 pages
A Brief Survey and An Application of Sem
No ratings yet
A Brief Survey and An Application of Sem
38 pages
What Is Computer Vision
No ratings yet
What Is Computer Vision
9 pages
Admin,+4554 Article+Text 17736 2 10 20210928
No ratings yet
Admin,+4554 Article+Text 17736 2 10 20210928
13 pages
Simulation Practical
No ratings yet
Simulation Practical
68 pages
CPCS335 - Chapter 9-Final
No ratings yet
CPCS335 - Chapter 9-Final
24 pages
Whitepaper ML
No ratings yet
Whitepaper ML
4 pages
UNIT-4 2nd
No ratings yet
UNIT-4 2nd
6 pages
Computer Vision
No ratings yet
Computer Vision
13 pages
A Comprehensive Guide to Computer Vision
No ratings yet
A Comprehensive Guide to Computer Vision
6 pages
Basic Block
No ratings yet
Basic Block
7 pages
RAC On Windows 2003: Julian Dyke Independent Consultant
No ratings yet
RAC On Windows 2003: Julian Dyke Independent Consultant
74 pages
10.1007@978 3 030 17795 9
No ratings yet
10.1007@978 3 030 17795 9
833 pages
Translated CV Alberto Estanes
No ratings yet
Translated CV Alberto Estanes
4 pages
EXCAR Battery EX51878 Manual (Beta Version)
No ratings yet
EXCAR Battery EX51878 Manual (Beta Version)
11 pages
A Review On Deep Learning Applications
No ratings yet
A Review On Deep Learning Applications
11 pages
IRPDS30A
No ratings yet
IRPDS30A
1 page
Components & MIS Application in Business
No ratings yet
Components & MIS Application in Business
7 pages
HTTP - HyperText Transfer Protocol - Javatpoint
No ratings yet
HTTP - HyperText Transfer Protocol - Javatpoint
4 pages
Absolute Encoder: Changchun Rondge Optics Co., LTD
No ratings yet
Absolute Encoder: Changchun Rondge Optics Co., LTD
11 pages
9.3 LimitsOfAlgorithms
No ratings yet
9.3 LimitsOfAlgorithms
2 pages
DS-1 Training Brochure 2024
No ratings yet
DS-1 Training Brochure 2024
7 pages
Screenshot 2023-09-12 at 22.32.23
No ratings yet
Screenshot 2023-09-12 at 22.32.23
1 page
SBI Mob Number Change
No ratings yet
SBI Mob Number Change
3 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
23 pages
Chapter Ii RRL
No ratings yet
Chapter Ii RRL
4 pages
Vision Transformer Understanding
No ratings yet
Vision Transformer Understanding
3 pages
Notes On COMPUTER VISION
No ratings yet
Notes On COMPUTER VISION
10 pages
Ideas for a Better World: A Technologist's Blueprint
From Everand
Ideas for a Better World: A Technologist's Blueprint
Chinmoy Mukherjee
No ratings yet
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
From Everand
Internet of Things (IoT) A Quick Start Guide: A to Z of IoT Essentials
Chitra Lele
No ratings yet