Research Paper (2) Done
Research Paper (2) Done
Abstract:
Through creating AI systems that can interpret and grasp video content in a way that is similar to
that of a human, human-centric deep video understanding seeks to close the gap between
computer vision and human perception. Deep learning-based methods currently in use overlook
the complexities of social relationships, human behavior, and emotional intelligence in favor of
object detection, action recognition, and scene interpretation. Their inability to fully comprehend
audiovisual content is hampered by this restriction. Human-centric deep video comprehension is
an effort to bridge the gap between computer vision and human perception by developing AI
systems that can comprehend and interpret video content in a manner akin to that of a human.
When it comes to object detection, action recognition, and scene interpretation, deep learning-
based techniques now in use ignore the intricacies of social interactions, human behavior, and
emotional intelligence. This limitation makes it difficult for them to properly understand
audiovisual content. Our method has applications in video analytics, social robotics, and human-
computer interaction, among other areas. We can enhance AI systems' capacity to communicate
with people, identify social signs, and offer more precise insights into human behavior by
creating systems that can comprehend video information from a human perspective. In order to
develop AI systems that are more like humans, this abstract suggests a novel approach to deep
video understanding that focuses on the subtleties of social interactions and human behavior.
Introduction:
Everyone is putting a big emphasis on artificial intelligence (AI), and it is all for a good reason.
It is worth knowing that today machines have outsmarted humans in areas like face and object
recognition, IQ tests, games, friendly speech recognition, written text comprehension and
translation, and much more. The improvement of all these technologies is attributable to three
basic pillars of technology. This entails access to ‘big data’ such as thousands of hours of
transcribed speech data and tens of millions of labeled image data. The second one is the nature
of the availability of a large computational capability like resources in terms of GPUs or, clusters
of cloud servers and so on. Along with these two, there are many other areas in ML which have
observed many advancements like deep learning and reinforcement learning. This is probably
why they could say that today we live in an era of AI.
Deep CNN solutions, for still images classification tasks, has been adopted for video
classification tasks since year 2012. However, recent advancements have introduced a new
paradigm in AI: The nearest in this family is known as Vision Transformer or in short as ViT. Or
rather, while ViTs also employ self-attention to extract such long-term contexts as CNNs, the
images themselves are processed in groups of patches rather than images. This has been applied
and some promising results have been observed from various image classification ASLs
including when it is used instead of standard CNN.
With more computing resources and especially with big data, it can in general, be scaled up and
this is why this model is useful in the field of Computer Vision. This has happened lately, as
there is a tendency to expand ViTs for other purposes such as, for instance, object detection and
semantic segmentation whereas the general development of this field gradually leans on
incorporating higher-level computer vision technologies in real-world applications. Thus, it is
safe to claim that Vision Transformer belongs to other leaders of Artificial Intelligence, as it
provides a firm basis for achieving the goals in the vast and incredibly vast area of visual
information.
Related Work:
Research findings that point to the level of appreciation, learning and thought, as well as the
number of tasks executed with minimal vision ranges from 80% to 85%. This means that in AI it
is also important to develop other forms of intelligence including the visual intelligence. All
these are as a result of advancement in the application of deep learning whereby the assessment
and identification of images has been greatly enhanced.
These approaches are a new series which are developed from the convolutional neural networks
abbreviated CNNs which were developed in 2012 and have since enhanced the performance
greatly. But there has been a new wind for change in the recent year proposing a net model
called Vision Transformer (ViT), where images are split into patches and self-attention is used to
capture the longer than patch sizes dependencies in images. This change of perspective has been
proved to be quite effective in various directions of image classification, although often such
solutions outperform rather deep CNNs.
The Vision Transformer is equally scalable with larger datasets and more computing power and
has become one of the significant trends in the Computer Vision. Using ViT’s, had an
enhancement in object detection and semantic segmentation which increases the reliability of
ViT’s. Figure one: Comparison of error rate of ViT-based systems with previous CNN-based
winners on ImageNet classification challenges.
In a paper published in 2020, Izmailov et al. have introduced the Vision Transformer use which
surpasses the traditional CNN-based systems for the ImageNet classification task. From this
explanation, it is understood that ViTs have the capability to perform better in interpreting the
images than human annotators in the same way deep residual networks did in 2015. This transfer
from traditional CNNs to Vision Transformers is yet another demonstration of how active deep
learning methodologies most likely proceed in a fairly strained tempo.
They have also been useful in cases of object detection as well as in semantic segmentation,
which also returned good results. Such developments place us on the trajectory of proposing
state-of-the-art technology built on the applicability of computer vision in various real-world
applications and giving birth to a new generation built on AI-driven visual comprehension.
Methodology:
Generalizations such as the one in the heading are that normally, if there is a great leap in one of
the two broad technological areas of still images, there will not be much time before similar
improvements are made also in the field of moving pictures. Video analytics to support
intelligent applications are well-accepted in the enterprise as well as the consumer sectors.
Unlike the more obvious area of public security, new and upcoming uses include the business
sector, home security, self-driving cars, and narration.
They also presented a relatively large number of difficulties in comparison with images in
general. It is even more challenging when one is detecting pedestrians in videos rather than in
still images as videos encompass almost all the possible content that could be conceived.
Storage, computation and communication are the major requirements that are greatly demanded
in today’s world. At times there may be a necessity to process the information as it comes in or is
provided.The general labeling of the video data is specifically expensive while in some
occasions, we do not have sufficient training data. For instance, in the case of desire to analyze
surveillance video data, access to the data is frequently problematic and the number of positive
samples may be low. Such challenges push the availability of video analytics technologies into
practical application to the limits.
Fig. 1. Performance of the winners of the ImageNet classification competitions over the years.
Results:
Object recognition in a video depends on the identification of people and their characteristics and
what they are doing in the scene. Much has been achieved in many fundamental vision tasks with
the focus on human centric applications. The below gives an outline of some of these
technologies as follows.
A) People Tracking
Computation of Visual Object Tracking is one of the universally important issues discussed
under video analysis and understanding. It is the job of a tracker to locate the target object within
any of the frames of a video once the bounding box of the target object has been provided in the
first frame of the video. We can generally regard single object tracking as a detection tracking
problem because in essence, it involves tracking detection of this particular object from one
frame to the other. The major challenge lies in the fact that at the same time, it is necessary to
respond to two opposite, although partly contradicting, demands – on the one hand, the focus on
the ability to identify unfamiliar classes, on the other hand, the ability to distinguish between the
classes already seen. Here, robustness comes into the picture when a tracker should not lose track
of the target even if there is variation in illumination, the target is in motion or change in view
angle or object deformation. On the other hand, in the case of a tracker, it has to segregate the
target object from another similar object plus, or a number of objects in the background, etc.
Both are conventionally managed through online training that strives to attain flexibility.
Since 2015, there has been some research where authors employ CNNs in their work. Although
deep features are disadvantageous in the sense that they slow down the speed of online training
in some way, this very advantage of deep features allows one to completely eliminate online
training. The first work in this regard is SiamFC where the Siamese CNNs are utilized to extract
the features of the target and the search region and followed by conventional cross-correlation
layer which provides a efficient sliding window mechanism. Therefore, SiamFC achieves the
working frame rate of 86fps on the GPU side. Though the basic concept can be attributed to this
paper, there are many follow-up works like SA-Siam and SiamRPN.
Fig. 2. Accuracy-speed trade-off of top-performing trackers on the OTB-100 benchmark. The speed
axis is logarithmic. Reproduced from Fig. 8 of [7]. Please refer to [7] for the notations of different
trackers. “Ours” refers to the SPM-Tracker [7].
Most of the object tracking work which utilize SiamFC has a single-stage architecture while a
two-stage SiamFC-based network has been introduced to solve for robustness and
discriminativeness. Above all, this paper investigates two stages: the first coarse matching stage
aims to improve robustness and the second fine matching stage aims to increase the
discrimination power by replacing cross-coritation layer with more elaborate distance learning
subnetwork. These results are statistically sound and the tracker operates at a phenomenal 120
framerates based on an NVIDIA P100 GPU. Some of these comparisons have been highlighted
in Figure 2 which illustrates the statics of the benchmark OTB-100 to show the best trackers with
high accuracy and low speed.
In various real-life situations, the need appears to track several individuals at once. In general,
such an approach is not very efficient, when multiple-person tracking is solved as a sum of
multiple single-person tracking problems. Modern multi-person tracking approaches rely on the
tracking-by-detection paradigm, which means using a general object/person detector to detect
objects (of the same target class, which is person) in distinct frames and then connecting the
detections. These include; importance sampling and particle filtering for the propagation of the
state in a Bayesian way, linking the short tracks over long time using the Hungarian Algorithm
for the optimal assignment and the greedy Dynamic Programming in which the trajectories are
put one step at a time. Recent studies to increase the reliability of wrong identity assignment
have posed the task of connecting detections for a longer period using optimization techniques.
A typical method in multi-person tracking problem is the use of constrained flow optimization
they use the k-shortest paths method. Other related approaches include graph-based formulations
of the minimum cost multicut problem.
The work compared the tracking-by-detection approach that separates tracking from detection to
a tracking-by-detection approach even though the former may not be the most efficient method, a
lot more research should be dedicated to joint detection and tracking approaches applied to
multiple-object/people tracking. Multiple-Object/Pet Multiple People Interactions: More use of
spatial and temporal structures needs to be made; the balance between the complexity and
accuracy has to be found. For long-term tracking, it becomes even more complex and often
provides intermediate short-term tracklets, and completing linking/matching tracklets, over time,
it becomes necessary, for instance, by using object re-identification (re-ID) methods.
The basic of people tracking has been revolutionized by the Vision Transformer (ViT) since it
uses self-attention display that can tackle spatial and temporal view. ViTs can input sequences of
the image patches, which makes the network ideal for analyzing videos where temporal
menageries are crucial. They easily handle large amounts of data and resources and provide
strong solutions for solving complex tracking problems. In addition, ViTs can be fine-tuned with
lesser amounts of supervised data because they are less sensitive to the degradation associated
with expensive labeling and limited data samples.
For fanatical NHPs detecting technologies, ViTs can be implemented to meet occlusion and body
deformation on purpose. It is possible to extend the hypotheses about the use of the ViT
approach in the creation of more complex re-ID architectures in the future, especially given the
possibility of matching hypotheses over increasingly significant temporal distances between the
tracks, along with the integration of holistic deep feature representation together with the
extracted body pose layouts. This in turn improves the possibility of tracking a person as
efficiently throughout the longer periods and extension of various conditions. Further, specific
discourse on person re-ID will be carried out in Section III. C.
C) Person re-identification
Person re-identification (re-ID) using Vision Transformers (ViT) involves finding a specific
person from the multiple camera angles or times, or from the same angle but at different time.
This task is considered difficult due to changes in person pose, viewpoint, detection, background,
occlusion, and lighting. Subsequent work has been devoted to addressing these problems by
building upon what ViT excels in addressing, which are spatial misalignment and semantic
alignment.
One proposed solution is to use ViT’s architecture to split the person image into non-overlapping
patches, which allows to learn densely semantically aligned features. This makes it possible for
semantically aligned<|reserved_special_token_252|> model construction and analysis of
semantically aligned feature learning. To tackle with the issues like errors and the difficulties in
handling the non-overlapping areas, a learning paradigm can be introduced to learn the features
with the help of another stream.
Otherwise, an encoder-decoder structure can be applied: On the contrary, the encoder employs
ViT to acquire the re-ID features of the input image, and the decoder synthesizes a 3D full-body
texture image in the canonical semantic space. This ensures that the learned features are invariant
to view and pose, thus explaining the elimination of Figure 5 visible body discrepancies in
matching images. The decoder is only employed during training while it does not contribute to
the model’s complexity during the testing phase. Integrating person re-ID in ViT holds
promising as it can take advantage of the approach and detect space and hierarchical components
besides the semantic alignment.
Considering the fact that human attention goes to the main ideas in the content, for temporal
analysis, we utilize Vision Transformers (ViT) and recognize co-occurrence patterns of
unification joint features. As the result of the deep learning, for the computer vision task, the
patch-based architecture and self-attention mechanism in ViT can aid in the feature extraction
and weighting. For this, we introduced the spatial attention weights for ViT for joints in each
frame of the action sequence, and temporal attention weight curve for the action prediction
output.
The main idea of adaptation assumes changing view variation by using view adaptation layer that
takes advantage of ViT learning hierarchical spatial organization and semantic correspondence.
It also reconstructs the input 3D skeleton sequence based on the input action clips to be a
consistent view, while aiding the main action classification network in focusing more on action
details regardless of the views.
Fig. 4. Illustration of a retail intelligence scenario where multiple cameras are deployed, 3D space is
reconstructed, people are detected and tracked, and heat-map (in purple) is generated.
Thus, the proposed modifications to ViT, primarily the use of attention mechanisms and view
adaptation, have shown a proper impact as we obtained up to 6% absolute improvement on a
benchmark dataset than the basic ViT which emphasizes the further development of attention
mechanisms in vision tasks. The described approach allows machines to recognize actions in
learned views that are disturbed by view variation, which is similar to human abilities.
Conclusion:
That is why, for a better and comprehensive understanding of videos and video sequences,
human-oriented vision tasks must be incorporated. This is the area in which the system
perspective is most helpful, as one can effectively combine advantages in one BB and
constructively address possible risks in others. Depending on the application, a practical system
may integrate some or all of the following: Some applications include person detection and
tracking, re-identification, pose estimation, action recognition, detection of heatmaps , etc.
For instance, in a retail intelligence with several cameras as depicted in fig. 4, there may be the
need to track or recognize a customer through face detection, body only detection, bones only or
both. The longitudinal linkages are important in joining tracklets of the similar time period and
the same person across different cameras through cohort analysis and person re-ID, respectively.
This way , heat maps might be created while tracking the people and to know more about the
customer. It’s at this point where we can use the estimated pose sequences or even pose and
RGB data to identify more of the detailed activities.
Hence, for efficient fusion, one can make use of Vision Transformer (ViT) which can be
integrated into a multi-task learning approach by sharing the feature extraction net. This creates a
clear possibility of feature extraction plus, it also offers a chance of processing different tasks
simultaneously.
Special consideration has to be paid to dependency constraints when dealing with real time
interactive application like Video teleconferencing is very demanding in terms of size can be as
low as 100k bytes and also very strict on speed in terms of ms per frame. Therefore, the task
should be shifted to the reduction of model size, and the methods that are focused on accelerating
the ViT-based models include knowledge distillation, model pruning, and quantization. Hence,
incorporating ViT with other human-centric vision tasks will be able to give a complete outlook
of the video comprehension and yield a more topnotch system level result.
FUTURE PERSPECTIVES
I put emphasis on the human understanding in the context of videos because it is critical for
videos comprehension. However, there has been notable recent advances in vision tasks with a
focus on human in the loop, however, majority of such advancements only address individual
component tasks and do not take into consideration inter relations of the entities and the cause
effect relationships between the actions and the events. This is partially caused by the fact that to
training complex tasks, the amount of labeled data must grow at an exponential rate. To this
effect, human knowledge is sacrosanct and should be implemented in the learning systems so as
to minimize the dependence on data driven approaches to learn such semantic relationships.
Fortunately, some attempt has been recently made to introducing the human knowledge into the
models like graph convolutional networks (GCNs) and the Symbolic Graph Reasoning (SGR)
layers for the enhanced video understanding. Similarly, there is an absurd in using semi-
supervised and unsupervised learning technologies like Vision Transformers (ViT) pre-trained
on unlabelled datasets.
While there are approaches to solving some aspects of actual human usage, such as using
attention mechanisms in some degree, there a great deal that is still unknown about the human
brain. It means that by developing a human-centric approach which is based on a rather simple
idea: one has to grasp the humans and use this knowledge to improve the video understanding
technologies, one will be moving in the right direction.
While the advancement in the development of research has been recorded, the actual practical
application of the development has been more of a slow process. However, Industrial leaders and
start-ups are navigating to technologies to markets through the various GO to Market Areas of
Application such as Retail analytics, smart care of aging population, and smart security. In all
these cases, machine intelligence is required and one day in the near future, such technologies
are expected to advance and promote these uses.
However, Vision Transformers (ViT) that are built on the transformer architecture incorporate
transformer models into various vision tasks and perform quite well. To achieve these goals, it is
important to incorporate human knowledge so as to augment the advantages of the model while
utilizing ViT for the enhancement of humans perception in videos and the realization of practical
applications.
ACKNOWLEDGEMENT
The author would also like to express his gratitude to all his colleagues and interns at Microsoft
Research Asia for numerous valuable and inspiring discussions that have influenced this work
and the opinions and ideas expressed in this paper. Cuiling Lan and Chong Luo specially are
thanked for their consistent inter-disciplinary cooperation when dealing with the specific
technical details as described in the paper in reference to Vision Transformers (ViT) work; while
Xiaoyan Sun and Chunyu Wang are thanked because they also work on similar research areas.
The high-quality and constructive comments and suggestion – these points offered by them were
quite valuable for improving the author’s impression and valuable experience human oriented
video analysis.
FINANCIAL-SUPPORT
To the best of the authors’ knowledge, this study does not have any Current or Planned Sources funding;
commercial or not-for-profit.
CONFLICT OF INTEREST
None
Reference:
Li B. ; Yan J. ; Wu W. ; Zhu Z. ; Hu X. : High performance visual tracking via siamese region proposal
network arXiv:1606. 02281, 2016 in Proc. of the IEEE Conf. S. Kamvar et al. , ‘feeler: A Context-aware
and Content-private Social Recommender System,’ Proc. of the 7th ACM Conference on Recommender
Systems, New York, 2013, 169–176.
Wu Y. ; Lim J. ; Yang M. -H. : The method under consideration was described in detail in the paper
Benchmarking on line object tracking, in Proc. of the IEEE Conf. Publication entitled ‘A Novel Approach
of Adaptive Scaling for Histogram Equalization with Application to Image Enhancement’ published in the
Computer Vision and Pattern Recognition Conference in Portland in 2013, page 2411 – 2418.
Giebel J. ; Gavrila D. ; Schnorr C. : A Multi-cue Bayesian approach for 3D object tracking at European
Conference. : Proceedings of the First International Conference on Computer Vision, Prague, 2004.
Fleuret F. ; Berclaz J. ; Lengagne R. ; Fua P. : Aegis: A system for on-the-fly construction of large-scale
multi-camera people tracking using probabilistic occupancy map, IEEE Trans. Pattern Anal. Mach. Intell. ,
2 (2008), 267–282
Berclaz J. ; Fleuret F. ; Turetken E. ; Fua P. : Multiple object tracking application using k-shortest paths
optimization. IEEE Trans. Pattern Anal. Mach. Intell. , 9 (2011), 1806–1819.
Ristani E. ; Tomasi C. : Stalking several individuals on the internet as well as the possibilities to track
people in the real time in Asian Conf. in the field of Computer Vision, Singapore, October 14-17 2014.
Tang S. ; Andriluka M. ; Milan A. ; Schindler K. ; Roth S. ; Schiele B. : Learning people detectors for
tracking in crowded scenes: in Proc. of the IEEE Int. Conf. conference on Computer Vision, Sydney, 2013.
Tang S. ; Andriluka M. ; Andres B. ; Schiele B. : Multiple people tracking by lifted multicut and person
reidentification, in Proc. of the IEEE Conf. In Computer Vision and Pattern Recognition, Honolulu, 2017.
Yang Y. ; Ramanan D. : Flexible mixtures-of-parts for articulated pose estimation, in Proc. of the IEEE
Conf. at Conference on Computer Vision and Pattern Recognition, Colorado Springs, 2011.
Chen X. ; Yuille A. : Graphical model based articulated pose estimation with image dependent pairwise
relations, in NIPS’14, Montreal, December 2014
Yang W. ; Ouyang W. ; Li H. ; Wang X. : End-to-end learning of deformable mixture of parts and deep
convolutional neural networks for human pose estimation : in Proc. of the IEEE Conf. in Computer Vision
and Pattern Recognition that was held in Las Vegas in year 2016.
Toshev A. ; Szegedy C. : DeepPose: human pose estimation via deep neural networks, in Proc. of the IEEE
Conf. Computer Vision and Pattern Recognition, Columbus, 2014.
Wei S. -E. ; Ramakrishna V. ; Kanade T. ; Sheikh Y. : Convolutional pose machines, in Proc. of the IEEE
conference on computer vision and pattern recognition (CVPR), 2014. of the IEEE Conf. Conference
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, June 27-
July 2, 2016.
Newell A. ; Yang K. ; Deng J. : Stacked hourglass networks for human pose estimation, in European Conf.
conference: Computer Vision, venue: Amsterdam, year: 2016.
Cao Z. ; Simon T. ; Wei S. ; Sheikh Y. : Realtime multi-person 2D pose estimation using part affinity fields
>, in Proc. of the IEEE Conf. DOI: 10. 1109/cvpr. 2017. 199
Sun K. ; Lan C. ; Xing J. ; Wang J. ; Zeng W. ; Liu D. : Human pose estimation using global and local
normalization, in Proc. of the IEEE Int. Conf. In: 2017 IEEE Conference on Computer Vision and Pattern
Recognition, CVPR 2017, Seattle, vol. Proceeding of the IEEE Conference on Computer Vision, Venice,
Italy, 2017.
Martinez J. ; Hossain R. ; Romero J. ; Little J. J. : “A simple yet effective baseline for 3D human pose
estimation,” presented at Proc. of the IEEE Int. Conf. Published in the IEEE Conference on Computer
Vision and Pattern Recognition, Venice, Italy on June 25, 2017.
Moreno-Noguer F. : Single Image Based Estimation of 3D Human Pose Through Distance Matrix
Regression, in Proc. of the IEEE Conf. System for Image Analysis and Object Recognition In: Proceedings
of CVPR, Hawaii, 2017, pp: 61-70.
Sun X. ; Xiao B. ; Wei F. ; Liang S. ; Wei Y. : Human pose regression by integration in: Proceedings of the
European conference on computer visionat conference entitled Computer Vision, in Germany, Munich,
2018.
Hartley R. ; Zisserman A. : This paper focuses on the Multiple Images in terms of Geometry in Computer
Vision. Published by Cambridge University press, Cambridge, in the year of 2003.
Amin S. ; Andriluka M. ; Rohrbach M. ; Schiele B. : Multiview pictorial structures for 3D human pose
estimation, in British Machine Vision Conf. , Bristol, 2013.
Qiu H. ;Wang C. ;Wang J. ;Wang N. ; ZengW: Cross view fusion for 3D human pose estimation, In: CROSS
VIEW FUSION FOR 3D HUMAN POSE ESTIMATION Conference on Computer Vision and Computer
Graphics, exhibition.
International Joint Conference on Computer Vision and Computer Graphics, exhibit of the IEEE Int. Conf.
In October we have successfully conducted the international Meeting on Computer Vision in Seoul,
Korea-2019.
Ionescu C. ; Papava D. ; Olaru V. ; Sminchisescu C. : Humanitarian 3. 6 m: state of the art big data and
machine learning methodologies that are helpful in employing batch and continuum data with the help
of predictive models for 3D human sensing in natural environments. IEEE Trans. Pattern Anal. Intell. , 7,
1325-1339.
Tome D. ; Toso M. ; Agapito L. ; Russell C. : Multi-stage Refinement and Recovery for Markerless Motion
Capture in Int. Conf. In 3D Vision, Verona, Italy, May 23 – 25, 2018.
Wang X. : Current state of research in multiple camera video surveillance. Pattern Recognit. From the
Letters point of view: Thus, a qualitative approach, Journal of Letters, 34(1), p. 3-19, published in 2013.
Varior R. R. ; Shuai B. ; Lu J. ; Xu D. ; Wang G. : This paper named A Siamese long shortterm memory
architecture for human re-identification was published in European Conf. , [3] S. Ioffe and C. Szegedy,
“Batch Normalization: Accelerating Deep Network Training. ”, In the proceeding of the IEEE Conference
on Computer Vision and Pattern Recognition, Amsterdam, July, 2016.
Su C. ; Li J. ; Zhang S. ; Xing J. ; Gao W. ; Tian Q. : A/Pose-driven Deep Convolutional Model for Person Re-
identification, [accepted to appear at], in: of the IEEE Int. Conf. Based on conference on Computer Vision
that was held in Venice, Italy in 2017.
Suh Y. ; Wang J. ; Tang S. ; Mei T. ; Lee K. M. : In the European Conf. , bilinear models with part-aligned
convolution have been presented for person re-identification. Sixth International Conference on
Computer Vision: Proceedings of the International Conference on Computer Vision (ICCV), Munich,
Germany 2018, October.
Cheng D. ; Gong Y. ; Zhou S. ; Wang J. ; Zheng N. : Person reidentification based on multi-channel CNN-
parts with improved triplet loss function in Proc. of the IEEE Conf. presented in the Computer Vision and
Pattern Recognition Conference that was held in Las Vegas 2016.
Wang G. ; Yuan Y. ; Chen X. ; Li J. ; Zhou X. : Person Retracking for Multi- Camera Tracking with Multiple
Granularity Representation, ACM Multimedia 2018, Seoul, Korea.
Li D. ; Chen X. ; Zhang Z. ; Huang K. : Deep Context-Aware Features over Body and Latent
Parts for Person Re-Identification at Learning, in Proc. of the IEEE Conf. Computer Vision and
Pattern Recognition in Computer Science, Honolulu, 2017.
Zhang Z. ; Lan C. ; Zeng W. ; Chen Z. : In contrast, densely semantically aligned person re-
identification, which was introduced at the IEEE Conf. Proceeding of the IEEE Computer Vision
and Pattern Recognition, Long Beach, June 2019. 45Guler R. A. ; Neverova N. ; Kokkinos I. :
Densepose: dense human pose estimation in the wild, in Proc. of the IEEE Conf. in Computer
Vision and Pattern Recognition taking place in Salt Lake City in 2018.
Jin X. ; Lan C. ; Zeng W. ; Wei G. ; Chen Z. : This paper, Semantics-aligned Representation
Learning for person Re-Identification, was published at the AAAI Conference on Artificial
Intelligence held in New York in 2020.
Weinland D. ; Ronfard R. ; Boyerc E. : This paper reviews vision-based approaches to
representing, segmenting and recognizing action. Vis. Image. Underst. , ISSN 1152-7417, vol.
115, no. 2 (2011), pp. 224-241
Simonyan K. ; Zisserman A. : Convolutional two-stream networks for action recognition in
videos, in NIPS’14, Proceedings of the main conference and the workshops of the Twenty-
Seventh Conference on Computational Learning Theory, Montreal, Canada, June 2014, 568-576.
Tran D. ; Bourdev L. ; Fergus R. ; Torresani L. ; Paluri M. : Distributed Representation for
Spatiotemporal Features Learning with 3D Convolutional Networks for IEEE Int. Conf. the
International Conference on Computer Vision, Santiago, Chile, 6-12 December, 2015.
Feichtenhofer C. ; Pinz A. ; Wildes R. : This paper employs spatiotemporal residual networks for
segment-level video action recognition, InAdvances in Neural Information Processing Systems,
Barcelona, PP 3468- 3476, 2016.
Wang L. et al. Temporal segment networks: towards good practices for deep action recognition,
In European Confin On Computer Vision held in Amsterdam in June 2016, pp. 20–36.
Qiu Z. ; Yao T. ; Mei T. : with pseudo 3D Residual Networks, in proc. of the IEEE Conf.
conference on Computer Vision and Pattern Recognition in Honolulu in 2017, pp. 5533–5541.
Zhou Y. ; Sun X. ; Zha Z. ; Zeng W. : MiCT: mixed 3D/2D convolutional tube for human action
recognition, in Proc. of the IEEE Conf. Proceeding on Computer Vision and Pattern Recognition,
held in Salt Lake City, 2018.
Du Y. ; Wang W. ; Wang L. : For skeleton-base action recognition, the proposed hierarchical
recurrent neural network in the IEEE Conf. Invited Talk at Conference on Computer Vision and
Pattern Recognition, Boston, pp 1110-1118, 2015.