Recent Advances in Deep Learning Based Computer Vision
Recent Advances in Deep Learning Based Computer Vision
Dongjian Ma*
Leicester International Institute
Dalian University of Technology
Panjin, China
*Corresponding author: [email protected]
Abstract—Deep learning is a hot topic in current AI Deep learning comes from study of artificial neural
research and significant progress has been made in a variety of networks.Artificial neural networks is started to research in
fields, including computer vision and natural language the 1940s. McCulloch et al. [1] proposed MCP models.
processing. Traditional architecture of deep learning normally Actually, it's a simplified and abstract model based on the
based on convolutional neural networks. In recent years, structure and function of nerve cells. Then in 1958, the
impressive results from Transformer models on natural perceptron algorithm was proposed by Rosenblatt et al. [2].
language tasks have intrigued the vision community to study The perceptron approach employs the MCP model to
their application to computer vision problems. Thus, neural categorize the input multidimensional data and uses the
networks with Transformer architecture have been introduced
gradient descent method to automatically learn updated
to computer vision tasks and even challenge mainstream of
traditional convolutional neural network. In this paper, it will
weights from the training samples. The first wave came to a
review recent improvements in computer vision based on deep halt in 1969, when Marvin Minsky demonstrated that
learning method. It'll go over some of the most important contemporary neural networks were simply linear models,
research in deep learning models. After that, it focus on incapable of expressing simple logic such as xor, in his work.
advances in four key computer vision tasks: picture Then after nearly twenty years, The Boltzmann machine was
classification, object identification, semantic segmentation, and then proposed by Hinton et al. [3] using the simulated
human posture estimation. It'll go through the most recent annealing process. A back-propagation algorithm applicable
developments in these fields. Finally, it will provide a to MLP is developed and Sigmoid is used for nonlinear
conclution of recent computer vision advances and discuss mapping. This process effectively solves the problem of
future possibility. nonlinear classification and learning. This approach has led
to a second wave of neural networks. Then in the 1990s,
Keywords—component, deep learning, computer vision, miscellaneous machine learning methods, such as support
image classification, semantic segmentation, object detection vector machine , were proposed one after the other in the
1990s. These methods based on statistics and they are
I. INTRODUCTION
totoally different from neural network. Hence artificial
Deep learning is a popular machine learning subfield. neural networks then reached a new low. Artificial neural
Deep learning neural networks, or artificial neural networks, networks reawakened interest after Hinton et al. proposed a
try to emulate the human brain by combining input data, solution to gradient disappearance in deep network training:
weights and bias. These parts cooperate together to initialization of weights with unsupervised pretraining + fine
implement recognization, classification, and characterize tuning of supervised training. Then in 2012, AlexNet took
features in the input accurately. To connect their layers, deep first place in the ImageNet Challenge. AlexNet’s
learning models typically use hierarchical architectures. It performance demonstrates the potential and utility of deep
consists of lots of layers of interconnected nodes, each of learning once again, and it is AlexNet’s success that has
which refines and optimizes the prediction or categorization reignited interest in deep learning. In 2016, a match between
by building on the previous layer. Each connection is AlphaGo and Lee Sedol aroused the public's interest in deep
assigned with a weight. For each unit, the inputs are learning.
multiplied by the weights and summarized. The total is then
transformed via the activation function, which is normally Deep learning is probably the most important
sigmoid function, tan hyperbolic, or rectified linear unit advancement in computer science in recent years. It has had
(ReLU) in most circumstances. This progression of an impact on practically every scientific subject. Businesses
calculation is refered to as forward propogation. Only input and industries are already being disrupted and transformed
and output layers are visible in deep neural network. The by it. Deep learning is being pushed further by the world's
deep learning model starts working from receiving the data leading economies and technological companies. Deep
in the input layer, and then the final prediction or learning has already surpassed human skill and performance
classification is performed in the output layer. Back in a number of domains.Take the ImageNet Challenge
propagation is another process which training a model to (Russakovsky et al., 2015) (also known as ILSVRC) as an
calculate prediction errors and then update the weights and example to better understand the success of deep learning.
biases of the each connection by traveling backwards The goal of the classification challenge is to categorize
through the layers. Forward and back propagation work images based on input pixels using a training dataset of 1.2
together to allow a neural network to make predictions and million color images divided into 1000 categories. The
correct for any faults. After iterative training and updating performance of a classifier is then assessed on a test dataset
parameters, the accuracy improves. of 100,000 photos, with the top-5 accuracy being published
Deep learning started relatively late but developed very of CNN’s AlexNet [5]. It was also because of the race that
rapidly at home. Remarkable progress in colleges, CNN attracted the attention of many researchers. Alexnet
universities, research institutes and companies has been consists of five convolution layers and three fully connected
made. Pattern recognition, classification, clustering, layers. It uses ReLU activation function to increase
dimensionality reduction, computer vision, natural language convergence speed and solve the gradient disappearing
processing (NLP), regression, predictive analysis, and other problem.Since the ReLU approach does a good job of
challenges can all be solved with Deep Neural Networks. suppressing gradient disappearing issues, AlexNet ditched
the “pre-training + fine-tuning” approach in favor of fully
Deep learning has fueled great strides in a variety of supervised training. This is why DL’s mainstream approach
computer vision problems, such as object detection, motion to learning has become pure supervised.
tracking, action recognition, human pose estimation, and
semantic segmentation. We take a review of the main B. VGGnet
breakthrough in deep learning architectures and methods for VGGNet was proposed by Simonyan and Zisserman [6].
computer vision applications in this paper. Then, we’ll go It used small stacked convolutional filters (3x3) to replace
over some of the significant breakthroughs that deep learning larger convolutional filters and add more convolutional
approaches have made in computer vision, including image layers while keeping the other parameters constant.
classifications, object recognition, semantic segmentation, Parameters are reduced on the one hand and more nonlinear
and human position estimation. mapping is carried out on the other, which can increase the
The overview is intended for helping computer vision fitting ability. VGG has two structures, VGG16 and VGG19,
and multimedia analysis researchers, general machine which are essentially the same except for the depth of the
learning researchers better recognize current computer vision network. Vggnet demonstrated that raising the depth of the
state and comprehend future trend. deep neural network might affect the ultimate performance
of the networks to some amount.
II. DEEP LEARNING METHODS DEVELOPMENT IN COMPUTER
C. Resnet
VISION
He et al. [7] proposed one of problems of deep neural
One of the first areas to accomplish unification across all network is that accuracy saturates and even declines as
fields was computer vision. Convolutional Neural Networks network depth increases. To solve the problem they proposed
(CNN) handled every task in computer vision, from picture learning a residual function based on layer inputs. They
classification and target detection to image segmentation proposed ResNet which refers to and builds on the VGG19
and image enhancement, before Google’s “Attainment Is All network, adding residual units through short-circuit
You Need” paper appeared 17 years ago [4]. The paper mechanisms. Multiple layers were utilized in the residual
proposed a new architecture called Transformer. network to learn the residuals between input and output. As
Transformer has gain great success in natural language the number of direct connections grows, the vanishing
process (NLP). Then some ground-breaking work has since gradient problem is promoted, feature propagation is
been done on extending Transformer liked structures to strengthened, feature reuse is encouraged, and the number of
Computer Vision (CV) domains, with promising results on a parameters is reduced significantly.
variety of CV problems.In this section, we’ll go through
some of the most popular deep learning models from the D. Efficientnet
past as well as new tactics that have emerged in recent years. Tan and Le [8] demonstrated that balancing network depth,
A. Alexnet width, and resolutioncan can make better performance of
deep neural network and proposed Efficientnet. Efficientnet
To demonstrate the potential of deep learning, the Hinton introduce a new model scaling method wich uses a simple
class first entered the Image-net image recognition and efficient combination coefficient to enlarge CNNs in a
competition in 2012, where it won the title with the creation
175
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on December 08,2023 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
more structured way. Unlike traditional approaches, which considered to have strong feature extraction capabilities,
scale network dimensions randomly, such as breadth, depth, known as Backbone, and models that perform well in
and resolution, this method consistently scales network classified tasks tend to do well in dirty ones. In recent years,
dimensions using a set of determined scaling factors. As a with the success of transformer in NLP and some computer
result, EfficientNet-B7 obtains ImageNet's highest accuracy vision area, more and more research have been made to try to
while being smaller and faster. introduce transformer into image classification task.
E. Transformer The first Transformer backbone for image categorization
A recurrent neural network (RNN) is a kind of artificial was proposed by Dosovitskiy et al. [8]. The images are
neural network which process data sequentially. Traditional divided into non-overlapping fixed-size Patches. Then pull
recurrent neural network(RNN) calculations are limited to each Patch into a one-dimensional vector. Considering the
sequential, meaning that RNN correlation algorithms can larger dimensions of one-dimensional vectors, it is necessary
only be calculated sequentially from one side to the other. to compress the stretched Patch sequence through linear
Two issues arise as a result of this technique. First, the projection (nn. Line) and this also realize feature
model's parallel computation ability is limited since the transformation. To facilitate further categorization, the
calculation of time t is dependent on the calculation of time t- authors also introduced a learnable Class Token, which is
1. Second, the data is misplaced in the sequence. inserted into the starting position of the sequence derived
Transformer overcomes the above two issues by reducing the from the photo-tokenization. These sequences and a
distance between any two points in the sequence to a learnable location positional encoding were sent into N serial
constant using the Attention technique. Second, unlike RNN, Transformer encoders for global attention computation and
it is not a sequential structure, hence it has more parallelism. feature extraction, with internal multi-headed self-attention
modules for feature extraction between Patches or sequences,
Transformer has demonstrated great success in sequence and later Feed Forward modules for feature transformation
modeling and machine translation applications as an for each Patch or sequence. By pre-training with a huge
attention-based structure. Some CNN-based models private database (JFT-300M, which contains 300 million
attempted to obtain long-range dependencies by adding self- photos), ViT achieves excellent results on a variety of image
attention layer. Meanwhile, some researchers tried to use full recognition benchmarks that are comparable to classic CNN
transformer architecture and abandon traditional CNN. In the approaches (i.e., ImageNet, CIFAR-10, and CIFAR-100).
realm of image recognition, attention-based models have ViT has demonstrated Transformer's usefulness in CV tasks.
gotten a lot of attention, while Transformer has had a lot of Figure 2 is the model soup accuracy.
success in the field of natural language processing.
Following the lead, a number of projects have lately
switched Transformer to CV tasks with similar results.
Dosovitskiy et al. [9], for example, propose a complete
Transformer that uses patches of image as the input for
image classification and has attained state-of-the-art result on
a number of image classification benchmarks. Other CV
tasks, like as detection, segmentation, tracking, picture
production, and augmentation, have also shown excellent
results using visual Transformers.
There are two major challenges of Transformer
applications in computer vision. First, Visual entities vary
quickly, so visual Transformer may not perform well in
different scenarios. Second, Image resolution is high and
pixels are numerous.Therefore Transformer's global self- Figure 2. Model soup accuracy [10]
attention-based calculations lead to larger calculations. Ze
Liu et al. [10], proposed swing Transformer to address the Mitchell Wortsman and his colleagues [10] proposed that
problems. Swin Transformer's improvement is to introduce averaging the weights of numerous fine-tuned models with
shifted windows operation and hierarchical design. It can distinct hyperparameter settings enhances accuracy and
introduce the localization of CNN convolution operation and resilience. They call it “model soups” because, unlike a
it can save computation. typical ensemble, it averages numerous models without
incurring extra inference or memory costs. When fine-tuning
III. ADVANCES IN COMPUTER VISION massive pre-trained models like CLIP, ALIGN, and a ViT-G
pre-trained on JFT, their soup method become the best model
In this part, we take a look at some of the most recent on ImageNet.
developments in the field of computer vision. On most
mainstream databases, these works receive a new high mark. B. Object Detection
We follow paper with code website to track these works. Detecting instances of objects of a specific class inside an
A. Image calssification image is known as object detection. There are two primary
categories of state-of-the-art approaches: There are two types
Image classification is the most basic task in computer of methods: one-stage and two-stage. Traditional one-stage
vision, which is to classify images into specific semantic algorithms that emphasize inference speed include YOLO,
categories. In this area,convoluted neural networks have SSD, and RetinaNet. Two-stage algorithms that concentrate
made a splash in image classification. Since the birth of Alex detection accuracy include Faster R-CNN, Mask R-CNN,
Net in 2012, CNN has often been the model for image and Cascade R-CNN. In recent years Transformer-based
classification with superior performance. CNN is generally approaches have gained in popularity and efficacy.
176
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on December 08,2023 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
DETR [11] is groundbreaking work that was proposed by like object detector that incorporates numerous unique
Facebook AI in 2020 and introduced Transformer to strategies such as contrastive DN training, mixed query
detection. DETR sees target detection as a set prediction selection, and look ahead twice for distinct regions of the
problem and proposes a very succinct target detection DINO model.
pipeline: backbone CNN lifts features, feeds Transformer to
do association modeling through the attenuation mechanism, They undertake extensive ablation research in DINO to
and loss is calculated with the result output using a binary verify the efficacy of various design options. With ResNet-
graph matching algorithm with the ground truth on the 50 and multi-scale features, DINO achieves 48.3 Average
image. Although the DETR series model has achieved great Precision(AP) in 12 epochs and 51.0 AP in 36 epochs,
performance improvement, there are still two drawbacks: considerably exceeding the previous best DETR-like models.
First, the existing DERT models are still inferior to the DINO trained in 12 epochs, in particular, displays a greater
traditional detection models in performance. Second, improvement on small objects, with an improvement of
scalability of established DERT series models has not been +7.4AP. Figure 3 is training convergence curves evaluated
fully explored. on coco val2017 for DINO and two previous state-of-the-art
models with ResNet-50 using multi-scale features.
Hao Zhangg et al. [12] propose a new end-to-end DETR-
Figure 3. Training convergence curves evaluated on coco val2017 for DINO and two previous state-of-the-art models with ResNet-50 using multi-scale
features [12]
177
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on December 08,2023 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
movement tracking, human pose estimation in two- during inference. It then returns one of most possible class
dimensional images and videos has recently become a hot from the list after being supplied an image.
topic in the computer vision problem. In the realm of human
pose estimation, many state-of-the-art approaches
implemented using deep learning have solved various
obstacles and produced tremendously astonishing outcomes.
In recent years, deep learning approaches based on
Convolutional Neural Networks have outperformed
traditional methods in estimating human pose.
OmniPose proposed by Bruno Artacho et al , is a single-
pass, end-to-end trainable architecture for multi-person
posture assessment that achieves state-of-the-art results. The
OmniPose architecture utilize multi-scale feature
representations that boost the effectiveness of backbone
feature extractors without the need for postprocessing.
OmniPose introduced an enhanced waterfall structure
that can increase the field-of-view while maintaining the
high resolution of the feature diagram, the enhanced Figure 5. Comparison of average precision among omnipose to the original
waterfall structure is called the waterfall Astrus Spatial HRNet
Tooling (WASPv2) and acts as both a feature extractor and a
decoder.It introducethe Gaussian heatmap modulation V. CONCLUSION
method to help position the points of joint coordinates and
this can make the position of points in the upsampling In this paper, important work on deep learning methods
process more accurate, overcoming the problem of quant and recent advances in computer vision are summarized. It
error in the process of increasing resolution. The upgraded can be found that as Transformer is more and more widely
waterfall module in OmniPose provide multi-scale used in NLP, people are working hard to introduce
representations that take advantage of the effectiveness of Transformer into CV field. The following three stages were
progressive filtering in the cascade architecture and maintain involved in the development of CV Transformer: To begin
multiscale fields of vision comparable to spatial pyramid with, the problem that the CNN model structure can only
configurations. extract local information and does not have the ability to
consider global information is addressed by incorporating the
The results show the stability and efficiency for representation mechanism into CNN. Following that, related
multiperso-n pose estimation that provides state-of-the-art research began to employ the whole Transformer model to
results. replace CNN and solve the picture domain problem.Now that
Transformer has begun to solve the CV problem, more work
IV. ADVANDCES IN MULTIMODAL LEARNING has been done to optimize the details of the CV Transformer,
While models like GPT-3 and EfficientNet, which work including how to improve the running efficiency of high-
on text and images respectively, are responsible for some of resolution images, how to better convert images into
deep learning’s highest-profile successes, approaches that sequences to maintain the structural information of the
find relationships between text and images made impressive images, and how to balance running efficiency and
strides which is subject to multimodal machine learning. effectiveness. Lots of new advances based on transformers
were proposed. Transformer’s application in the field of CV
Multimodal momentum built upon decades of research. has matured, and more research will be done in the future,
In 1989, researchers at Johns Hopkins University and UC which will greatly contribute to computer vision.
San Diego developed a system that classified vowels based
on audio and visual data of people speaking. Over the next Another huge progress is made in multimodal machine
two decades, various groups attempted multimodal learning. Researchers used to have their hands full focusing
applications like indexing digital video libraries on one or the other because images and text are so complex.
and classifying human emotions based on audiovisual data. They created extremely distinct techniques as a result of this.
Figure 5 is comparison of average precision among However, during the last decade, neural networks have
omnipose to the original HRNet. converged on computer vision and natural language
processing, allowing for unified models that combine the two
OpenAI, proposed Contrastive Language-Image modes. It is expected that in the future, multimodal methods
Pretraining (CLIP), a zero-shot image classifier which has would contain audio in design to make more improvements.
obtained great results in multimodal machine learning. CLIP Multimodal machine learning might be one of ultimate AI
employs a text encoder (a modified transformer) and an methods since humans touch the world through a variety of
image encoder (a vision transformer) that have been trained sensory organs, such as the eyes, ears, and touch.
on 400 million picture-text pairs gathered from the web. It
trained to predict which of roughly 33,000 text fragments REFERENCES
would match a picture using a contrastive loss function [1] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas
adapted from ConVIRT. immanent in nervous activity,” The bulletin of mathematical
biophysics. Dallas, vol. 5, pp. 115–133, March 1943.
CLIP can do zero-shot classification in any image
[2] D. O. Hebb, “The organization of behavior,” J. Appl. Behav. Anal.
classification task since it can predict which text best fits an Los Angeles, vol. 25, pp. 575–577, April 1949.
image from any number of texts. CLIP is provided a list of [3] D. H. Ackley, G. E. Hinton and T.J. Sejnowski, “A learning algorithm
all possible classes in the form of “a snapshot of an object”
178
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on December 08,2023 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.
for boltzmann machines,” Cognitive Sci. Chicago, vol. 9, pp. 147– [10] Z. Liu, Y. Lin, and Y. Cao, “Swin Transformer: Hierarchical Vision
169, May 1985. Transformer using Shifted Windows,” 2021 IEEE/CVF International
[4] X. Wu, D. Sahoo, and S. Hoi, “Recent Advances in Deep Learning Conference on Computer Vision. London, vol. 56, pp. 9992–10002,
for Object Detection,” Neurocomputing. London, vol. 45, pp. 120– May 2021.
127, June 2020. [11] M. Wortsman, G. Ilharco, and S. Y. Gadre, “Model soups: averaging
[5] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet weights of multiple fine-tuned models improves accuracy without
classification with deep convolutional neural networks,” increasing inference time,” ArXiv. London, vol. 23, pp. 113–119,
Communications of the ACM. New York, vol. 60, pp. 84–90, July June 2020.
2012. [12] N. Carion, F. Massa, and G. Synnaeve, “End-to-End Object Detection
[6] S. Jin, S. Kim, and H. S. Kim, “Recent advances in deep learning– with Transformers,” ArXiv. lLondon, vol. 89, pp. 12872, March 2020.
based side–channel analysis,” ETRI Journal. London, vol. 20, pp. [13] H. Zhang, F. Li, and S. Liu, “DINO: DETR with Improved
107–113, March 2020. DeNoising Anchor Boxes for End-to-End Object Detection,” ArXiv.
[7] K. He, X. Zhang, and S. Ren, “Deep Residual Learning for Image London, vol. 56, pp. 113–119, April 2020.
Recognition,” CVPR. London, vol. 98, pp. 1103–1114, May 2016. [14] S. Shao, Z. Li, and T. Zhang, “Objects365: A Large-Scale, High-
[8] M. Tan, and Q. Le, “Efficientnet: Rethinking model scaling for Quality Dataset for Object Detection,” 2019 IEEE/CVF International
convolutional neural networks,” In Proceedings of the 36th Conference on Computer Vision. London, vol. 80, pp. 8429–8438,
international conference on machine learning. Houston, vol. 97, pp. May 2019.
6105–6114, July 2019. [15] Z. Liu, H. Hu, and Y. Lin, “Swin Transformer V2: Scaling Up
[9] A. Dosovitskiy, and L. Beyer, “An Image is Worth 16x16 Words: Capacity and Resolution,” ArXiv. London, vol. 35, pp. 110–118, June
Transformers for Image Recognition at Scale,” ArXiv, 2020.
abs/2010.11929.
179
Authorized licensed use limited to: ULAKBIM UASL - Osmangazi Universitesi. Downloaded on December 08,2023 at 15:39:31 UTC from IEEE Xplore. Restrictions apply.