An Intelligent Knowledge Extraction Framework For Recognizing Identification Information From Real-World ID Card Images
An Intelligent Knowledge Extraction Framework For Recognizing Identification Information From Real-World ID Card Images
ABSTRACT In this work, we study the problem of recognizing identification (ID) information from
unconstrained real-world images of ID card, which has extensively applied in practical scenarios.
Nonetheless, manual ways of processing the task are impractical due to the unaffordable cost of labor and
time consumption as well as the unreliable quality of manual labeling. In this paper, we propose an intelligent
framework for automatically recognizing ID information from images of the ID cards. Specifically, we first
conduct marginal detection using a multi-operator algorithm and then localize the region of ID card from
all the proposed candidate regions with SVM classifier. Furthermore, we segment linguistic characters from
the card region by an improved projection algorithm. Finally, we recognize the specific characters by an
eight-layer convolutional neural network. We perform extensive experiments on a Chinese ID card dataset
to validate the effectiveness and efficiency of our proposed method. The experimental results demonstrate
the superiority of proposal over other existing schemes.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/
165448 VOLUME 7, 2019
L. Zuo et al.: Intelligent Knowledge Extraction Framework for Recognizing Identification Information
A. DENOSING
In order to remove the noise and make the image smoother,
we first use the Gaussian low-pass filtering and image bina-
rization which can help eliminate noises and interference.
Filtering is the most basic operation of image processing.
In the broadest sense of the word ‘‘filtering’’, the value of a
filtered image at a given location is a function of the values of
the input image in a small neighborhood of the same location.
The Gaussian low-pass filtering computes a weighted average
of pixel values in the neighborhood [17]. By the Gaussian FIGURE 3. Intermediate result of the multi-operator algorithm which can
low-pass filtering, the noise in the images can be significantly acquire exact marginal information. The purpose of adding mosaics is to
protect the sensitive information of the ID card.
removed. It is worth mentioning that even though the bina-
ryzation is powerful for denoising, it has its own drawbacks. the gradients direction as follows:
The most important one among many drawbacks is the lose Gy
of useful color properties if images are binarized. 2 = atan(
)
The kernel of size (2k +1)∗(2k +1) Gaussian filter function Gx
is given by: where 2 is 0 for a vertical edge which is lighter on the right
side, 2 is π for the anther case.
1 − (x−(k+1))2 +(y−(k+1))2
Canny operator: The Canny edge detector uses a
Gxy = e 2σ 2
2πσ 2 multi-stage algorithm to detect a wide range of edges in
(1 ≤ x, y ≤ (2k + 1)) images and it’s optimal at any scale [20]. It has a sim-
ple approximate implementation in which edges are marked
We input images, and the Gaussian kernel is used for the
at maxima in gradient magnitude of a Gaussian-smoothed
Gaussian filter to convolve images, so as to make images
image. Empirical experience in the area of machine vision
smooth and eliminate the noise and interference.
recommends that the canny edge detection provides good and
reliable detection.
B. MULTI-OPERATOR IN EDGE DETECTING
We used multigroup parameters of the Sobel operator and
Using the multi-operator to detect edges is the core of our Canny edge detector together. Multigroup parameters and
proposed method. Without this stage we cannot obtain the combined results ensure that the algorithm can acquire the
marginal information of ID card for the ensuing segmentation exact marginal information. In fact, experiments show that
and recognition. the results of combined of the two operators outperform any
Sobel operator: The Sobel operator is a famous detec- single operators as shown in section V.
tion algorithm in the field of image processing and machine
vision [18], and it can create image with emphasized edges. C. DISCRIMINATION THE CANDIDATES
It uses two odd number matrices, one for horizontal changes There are several candidate images after executing the
and the other for vertical changes, as kernels for convoluting multi-operator algorithm, most of which are not very useful.
the original image to calculate approximations of the deriva- We therefore classify them. In this study, we use the Support
tives. Compared to other edge operator, the Sobel has two Vector Machine (SVM) model and the confidence algorithm
main merits: it adds some smoothing effect to the random for selecting the useful images before proceeding the next
noise of the image. Because the Sobel is the differential of step. The Support Vector Machine (SVM) [21] is a class of
two rows or two columns, the elements of the edge on both machine learning model used for classification and regres-
sides has been enhanced [19]. sion. Given a set of labeled training samples, each marked
sample belongs to one of two categories: positive or negative,
−1 0 1
Gx = −2 0 2 ∗A the SVM learns from those labeled samples to classify a new
−1 0 1 image into the two classes. The results of a trained SVM
−1 −2 −1
is a classification model that can be used to classify new
examples.
Gy = 0 0 0 ∗A
1 2 1 The SVM exhibits excellent generalization performance
(accuracy on test sets) in practice and have strong theoretical
where A is the source image, Gx and Gy are two images of motivation in statistical learning theory [22].
which each point contains the horizontal and vertical deriva- In our case, we have 1200 positive and negative sam-
tive approximations respectively, ∗ denotes the 2-dimensional ples, respectively. The positive samples are all regular ID
signal processing convolution operation. We then calculate card images with complete information, whereas the negative
FIGURE 9. In order to enlarge our training dataset, we used ‘‘raw’’ word zki+1 = gi (Wik zi + bik )
that not exist in our earlier training dataset to ‘‘create’’ some ‘‘artificial’’
ID cards and ignored syntax and grammar. where Wik and bik denote the k-th filter kennel and bias
respectively. The g(·) is a nonlinear activation function. In our
study, the Rectified Linear Unit (ReLU) [36] is used, and we
will introduce it later.
Max-pooling Layer: We conduct a max-pooling layer:
hi = max z2i (u, v) after two convolution layers to capture
the most important feature, where 2 denote the max-pooling
that uses the feature map of the second convolutional layer.
Max-pooling can maintain the translation invariant of image,
it means that the same feature will be active even the image
undergoes translations. It is very useful because our images
come from the real world photos, it always have a litter left or
right translation or tilt, but we need the model still accurately
classify it regardless of its position.
Fully-connected Layer: We conduct a fully-connected
layer h = [h1 , ..., hk ]T for classification purposes, where k
is the number of filters.
In addition to the above layers, the neural network is
FIGURE 10. The segmentation result of an ‘‘artificial’’ ID card, the segme- composed of input layer and output layer. The input layer
ntary effect and the pieces (samples) are the real data when we train our is the Chinese character images we introduced before, while
model. The recognition accuracy has been significantly increased since
these artificial samples were added into the train set.
the output layer is a softmax layer which we used to clas-
sify characters and then output the recognition results [37].
Inspired by Nature Language Processing (NLP) tasks, such
in the form of multiple arrays like an image composed of as language model [38], [39], machine translation [40], [41],
arrays containing pixel [31]. It is a particular type of deep, we used the vocabulary probability as our output. In this
feedforward network that is much easier to train and gener- approach, the dimension of the softmax layer is the same
alized than networks with full connectivity between adjacent as the vocabulary size, representing the probabilities of the
layers [32], [33]. The CNN has many practical achievements corresponding words. Although the number of parameters in
in image processing and widely used in the field of machine the layer of softmax is huge, this method is a promising way
vision. In the early 1990s, the CNN model has used for to predict words in NLP tasks. We present and analyze the
face recognition. Now, CNNs are the dominant model for experiment results in Section V.
recognition and detection task [9], [34], [35].
A CNN model has a series of layers. Most of CNN models V. EXPERIMENTS
have one convolution layer and one pooling layer followed to In this section, we present and discuss our experiments
obtain features and a fully connected layer as hidden layers, results.
then a classify layer to output, all hidden layers are free to To handle the data problem, firstly, we search available
stack. To obtain better features, our CNN model, with twice dataset on Internet. However, it does not work because the
stacked two conventional layers and one max-pooling layer, ID card has its peculiar fonts and those available datasets
is different with conventional CNN model structure. The are unsuitable. Besides, there are very limited references on
CNN has fully connected layers for nonlinear classification. Chinese ID card identification. Although some companies
The inputs of the model are the images of Chinese characters have developed their ID card identification SDK for limited
and the outputs are the corresponding recognition results. The use and commercial applications. However, as the principles
architecture of our model is shown in Fig. 11. and algorithms of ID card are not either publicized or reported
A. MULTI-OPERATOR EXPERIMENTS
Generally, the photos of ID cards are instantly photographed,
we should first locate the ID card in picture. As mentioned FIGURE 12. Multi-operator experiments.
Section II, we use multi-operator to obtain the ID card profile
and then to acquire the region of ID card in an image. Initially,
TABLE 1. The performances of three activation functions, the higher the
we used only the Sobel operator to detect the ID card profile score, the better the model.
in an image, but the results may be unstable. As a variety
of images will be collected, a single operators may not be
able to locate the ID card in each individual image. The worst
result is shown in Fig. 12(a), the location of ID card was not
identified correctively. The problem happens because the ID
card’s shape in images are diverse as we said before, so we
combined the Soble operator and the Canny operator and give
five group parameters for dealing with complex situation. evaluate our model with human perception of the recognition
As expected, the profiles of most ID cards can be extracted in quality.
this way, the result is shown in Fig. 12(b). Dataset and Evaluation: As mentioned before, we use two
datasets to test our model, the small one includes 1000 Chi-
B. CHARACTER RECOGNITION EXPERIMENTS nese character image samples while the largest set includes
We present our recognition experiment from the convolution more than 3000 image samples. We carried out a human
neural network model in this section. Because our model will evaluation and the reference label evaluation, the human
be used in some practical applications, we invite humans to evaluation is the human perception of the model’s recognition
TABLE 2. The performances of different structure of hidden layers. The 1,2 models are basic models, d denotes the basic model stack times, 3,4 models
are twice stacked model based 1,2 model respectively, 5 model is three times stacked model 2.
quality. We asked human rater to rate the recognition results TABLE 3. Evaluation of recognition quality by human side-by-side evalu-
ation. The number denotes recognition rate(%). The front of ID card incl-
in a side-by-side comparison way. udes information of name, gender, birthday, nation, address and ID card
Training details: Our character recognition moder uses a number of holder (1-6 in the table), the back have two lines of words are
the information of expiry date and issuance authority.
seven-layer CNN model to output a probability of p(c|x) over
our dataset alphabet C including 122/300 words, 10 digits and
a special capital letter ‘‘X’’, giving a total of 133/311 classes.
The input {z1 = x} of the CNN are gray-scale clipped char-
acter images of 30 × 30 pixels, the images is convolved with
64 filters of size 3 × 3 and pooled with activation function in
size of 2 × 2. The loss function is a cross entropy function
and all the parameters of the model are jointly optimized to
minimize the loss function over a training set using Stochastic
Gradient Descent (SGD) and back-propagation.
Because recognition accuracy of neural networks is
affected by using different activation functions, we compare
our model with three activation functions:
• Sigmoid function: a bounded in [0,1] and dif-
ferentiable real function which is computed as: and one max-pooling layer structure at last, the results show
sigmoid(x) = 1+e1 −x . our model is reliable and satisfactory in terms of accuracy
• Tanh (Hyperbolic function): Tanh function is a solution requirement of many scenarios. Before our work, most secu-
to the nonlinear boundary value problem, bounded in rity companies manually fill in the identification information,
−x x
[−1,1], and it is computed as: tanh(x) = eex +e−e−x . resulting in intensive human labor and a high percentage
• ReLU (Rectified Linear Unit) [42]: A simple function of error. In order to solve the problem of designing labo-
expressed as: f (x) = max{0, x}. Actually, the ReLU rious hand-typing work and increase the processing speed,
function becomes a standard nonlinear activation we developed the proposed approach and introduced human
function of CNN, because it provides a simple calcula- evaluation to illustrate the advantages of our algorithm in
tion and significantly reduce the number of iterations. terms of reducing manual labor and improving the accuracy
We use 100 ID card images to test the accuracy of the rate. We used 120 real-world ID cards excluding in our train-
three activation functions with the same parameters as our ing samples to evaluate the recognition rate. The results are
model. Table 1 shows the performances of the three activa- shown in Table 3 by human side-by-side evaluation. Since
tion functions. Appearently, the ReLU function improves the there are some anti-counterfeiting marks and patterns on the
recognition quality in all the cases. Therefore, we chose the back side, it is not an easy task to locate character areas and
ReLU function as the model’s activation function. recognize information. Moreover, the boundary between the
The performances of neural network also depends on the periods is highly interfering, leading to recognition errors.
structure of hidden layers, it can be greatly influenced by Therefore, the back recognition rate is often lower than the
the number of hidden layers. Using a conventional convo- front side. It should be noted that all the training datasets are
lution neural network always has a pooling layer after one all from real world ID cards, and due to the complexness of
convolution layer. We test some combinations of model for Chinese characters, it is normal that the character recognition
high accuracy. All cases of combinations used 100 ID card rate is a little bit lower than the number recognition.
images to test as the activation functions test, the result shown
in Table 2. Although the three times stack CNN structure C. MODEL RESULT EXHIBITION
acquires higher accuracy, taking account of the amount of Our model includes four parts: the image processing, ID card
time consumption, we chose the twice stacks CNN structure. location, character segmentation and recognition. The overall
According to the above experimental results, our model performance of the model depends on the performance of
uses the ReLU activation function, two convolutional layers these four sections. Therefore each section of the model
REFERENCES
[1] B. Shan, ‘‘Vehicle license plate recognition based on text-line construction
and multilevel RBF neural network,’’ JCP, vol. 6, no. 2, pp. 246–253, 2011.
[2] B. Yang, X. Zhang, L. Chen, H. Yang, and Z. Gao, ‘‘Edge guided salient
object detection,’’ Neurocomputing, vol. 221, pp. 60–71, Jan. 2017.
[3] Z. Wang, G. Xu, Z. Wang, and C. Zhu, ‘‘Saliency detection integrating
both background and foreground information,’’ Neurocomputing, vol. 216,
pp. 468–477, Dec. 2016.
[4] G. Cheng, J. Han, L. Guo, X. Qian, P. Zhou, X. Yao, and X. Hu, ‘‘Object
detection in remote sensing imagery using a discriminatively trained
mixture model,’’ ISPRS J. Photogramm. Remote Sens., vol. 85, no. 9,
pp. 32–43, 2013.
[5] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, ‘‘Object detection in
optical remote sensing images based on weakly supervised learning and
high-level feature learning,’’ IEEE Trans. Geosci. Remote Sens., vol. 53,
no. 6, pp. 3325–3337, Jun. 2015.
[6] H. Sun, X. Sun, H. Wang, Y. Li, and X. Li, ‘‘Automatic target detection in
high-resolution remote sensing images using spatial sparse coding bag-of-
words model,’’ IEEE Geosci. Remote Sens. Lett., vol. 9, no. 1, pp. 109–113,
Jan. 2012.
[7] L. Zhang, L. Zhang, D. Tao, and X. Huang, ‘‘Sparse transfer mani-
fold embedding for hyperspectral target detection,’’ IEEE Trans. Geosci.
Remote Sens., vol. 52, no. 2, pp. 1030–1043, Feb. 2014.
[8] G. Cheng, P. Zhou, and J. Han, ‘‘Learning rotation-invariant convo-
lutional neural networks for object detection in VHR optical remote
sensing images,’’ IEEE Trans. Geosci. Remote Sens., vol. 54, no. 12,
pp. 7405–7415, Dec. 2016.
FIGURE 13. The recognition result. [9] Y. LeCun, Y. Bengio, and G. Hinton, ‘‘Deep learning,’’ Nature, vol. 521,
pp. 436–444, May 2015.
[10] Y. Guo, Y. Liu, A. Oerlemans, S. Lao, S. Wu, and M. S. Lew, ‘‘Deep
learning for visual understanding: A review,’’ Neurocomputing, vol. 187,
is equally important. The results of our model are shown pp. 27–48, Apr. 2016.
[11] W. Liu, Z. Wang, X. Liu, N. Zeng, Y. Liu, and F. E. Alsaadi, ‘‘A survey of
in Fig. 13, the original picture is a natural picture such as a deep neural network architectures and their applications,’’ Neurocomput-
person readily photographed, our model extracts the ID card ing, vol. 234, pp. 11–26, Apr. 2017.
region in the complex background and recognize it. Although [12] X. S. Xin, Q. Song, L. Yang, and W. H. Jia, ‘‘Recognition of the smart card
iconic numbers,’’ in Proc. Int. Conf. Electron., Inf. Comput. Eng., vol. 44,
the result are not 100% correct, the first number of ID number 2016, Art. no. 02087.
is wrong as shown in Fig. 13, the method is still acceptable [13] C. Yao and G. Cheng, ‘‘Approximative Bayes optimality linear discrimi-
and can significantly decrease human one work load. nant analysis for Chinese handwriting character recognition,’’ Neurocom-
puting, vol. 207, pp. 346–353, Sep. 2016.
Moreover, it is worth mentioning that our Chinese char-
[14] W. Wu, J. Liu, and L. Li, ‘‘Text recognition in mobile images using
acter data set is only about 3000 character image labels in perspective correction and text segmentation,’’ Int. J. Signal Process.,
total, including artificial data and real data but not cover the Image Process. Pattern Recognit., vol. 9, no. 10, pp. 171–178, 2016.
common Chinese words. Our future work aims at solving [15] J.-B. Wang, N. He, L.-L. Zhang, and K. Lu, ‘‘Single image dehazing
with a physical model and dark channel prior,’’ Neurocomputing, vol. 149,
these flaws in order to improve our model. pp. 718–728, Feb. 2015.
[16] Z.-J. Zhu, Y. Wang, and G.-Y. Jiang, ‘‘Unsupervised segmentation of
natural images based on statistical modeling,’’ Neurocomputing, vol. 252,
VI. CONCLUSION pp. 95–101, Apr. 2017.
In this paper, we proposed an effective machine learning [17] C. Tomasi and R. Manduchi, ‘‘Bilateral filtering for gray and color
approach for real-world Chinese ID card recognition, includ- images,’’ in Proc. ICCV, Jan. 1998, vol. 98, no. 1, pp. 839–846.
[18] S. Gupta and S. G. Mazumdar, ‘‘Sobel edge detection algorithm,’’ Int. J.
ing all the techniques that are critical to ensure its accuracy. Comput. Sci. Manage. Res., vol. 2, no. 2, pp. 1578–1583, Feb. 2013.
To the best of our knowledge, we are the first to use the deep [19] W. Gao, X. Zhang, L. Yang, and H. Liu, ‘‘An improved sobel edge detec-
learning method to identify Chinese ID card, and also we are tion,’’ in Proc. 3rd IEEE Int. Conf. Comput. Sci. Inf. Technol. (ICCSIT),
vol. 5, Jul. 2010, pp. 67–71.
the first to publicly report the use of deep learning methods
[20] J. Canny, ‘‘A computational approach to edge detection,’’ IEEE Trans.
to identity Chinese ID card, which will drive the academic Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986.
development of the OCR identification technique. On our [21] C. Cortes and V. Vapnik, ‘‘Support-vector networks,’’ Mach. Learn.,
real-world ID card dataset, the recognition quality of our vol. 20, no. 3, pp. 273–297, 1995.
[22] V. N. Vapnik, Statistical Learning Theory, vol. 3. New York, NY, USA:
method approaches accuracy of 94% which can significantly Wiley, 1998.
decrease human work and highly recommendable for use in [23] D. H. Ballard, ‘‘Generalizing the Hough transform to detect arbitrary
the industry. Moreover, for the limitation of the real-world shapes,’’ Pattern Recognit., vol. 13, no. 2, pp. 111–122, 1981.
[24] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, ‘‘You only look once:
ID cards dataset, we leverage an ID card characters dataset Unified, real-time object detection,’’ in Proc. IEEE Conf. Comput. Vis.
by training dataset ourselves to train a convolution neural Pattern Recognit., Jun. 2016, pp. 779–788.
[25] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, ‘‘Convolutional deep WENYU CHEN received the Ph.D. degree in
belief networks for scalable unsupervised learning of hierarchical repre- computer science from the University of Elec-
sentations,’’ in Proc. 26th Annu. Int. Conf. Mach. Learn. (ACM), 2009, tronic Science and Technology of China, Chengdu,
pp. 609–616. in 2009, where he is currently a Professor with the
[26] H. Qu, S. X. Yang, A. R. Willms, and Z. Yi, ‘‘Real-time robot path planning School of Computer Science and Engineering. His
based on a modified pulse-coupled neural network model,’’ IEEE Trans. research interests include computing theory, and
Neural Netw., vol. 20, no. 11, pp. 1724–1739, Nov. 2009. machine intelligence and pattern recognition.
[27] H. Qu, Z. Yi, and H. Tang, ‘‘Improving local minima of columnar compet-
itive model for TSPs,’’ IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 53,
no. 6, pp. 1353–1362, Jun. 2006.
[28] D. Ciresan, U. Meier, and J. Schmidhuber, ‘‘Multi-column deep neu-
ral networks for image classification,’’ 2012, arXiv:1202.2745. [Online].
Available: https://arxiv.org/abs/1202.2745
[29] A. Mohamed, G. E. Dahl, and G. Hinton, ‘‘Acoustic modeling using deep
belief networks,’’ IEEE Trans. Audio, Speech, Language Process., vol. 20,
no. 1, pp. 14–22, Jan. 2012. HONG QU received the Ph.D. degree in computer
[30] G. E. Dahl, D. Yu, L. Deng, and A. Acero, ‘‘Context-dependent pre- science from the University of Electronic Science
trained deep neural networks for large-vocabulary speech recognition,’’
and Technology of China, Chengdu, in 2006. From
IEEE Trans. Audio, Speech, Lang. Process., vol. 20, no. 1, pp. 30–42,
2014 to 2015, he was a Senior Visiting Scholar
Jan. 2012.
[31] H. L. H. Li, J. Chen, and Z. Chi, ‘‘CNN for saliency detection with with the Humboldt University of Berlin. He is
low-level feature integration,’’ Neurocomputing, vol. 226, pp. 212–220, currently a Professor with the School of Computer
Feb. 2017. Science and Engineering, University of Electronic
[32] Y. Le Cun, B. Boser, J. S. Denker, R. E. Howard, W. Habbard, Science and Technology of China. His research
L. D. Jackel, and D. Henderson, ‘‘Handwritten digit recognition with a interests include artificial intelligence and neural
back-propagation network,’’ in Proc. Adv. Neural Inf. Process. Syst., 1990, networks.
pp. 396–404.
[33] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, ‘‘Gradient-based learn-
ing applied to document recognition,’’ Proc. IEEE, vol. 86, no. 11,
pp. 2278–2324, Nov. 1998.
[34] J. Tompson, R. Goroshin, A. Jain, Y. LeCun, and C. Bregler, ‘‘Efficient
object localization using convolutional networks,’’ in Proc. IEEE Conf.
Comput. Vis. Pattern Recognit., Jun. 2015, pp. 648–656.
LI HUANG is currently pursuing the Ph.D.
[35] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and degree with the School of Computer Science
Y. LeCun, ‘‘OverFeat: Integrated recognition, localization and detection and Engineering, University of Electronic Sci-
using convolutional networks,’’ 2014, arXiv:1312.6229. [Online]. Avail- ence and Technology of China. Her research inter-
able: https://arxiv.org/abs/1312.6229 ests include deep learning and natural language
[36] V. Nair and G. E. Hinton, ‘‘Rectified linear units improve restricted Boltz- processing.
mann machines,’’ in Proc. 27th Int. Conf. Mach. Learn. (ICML), 2010,
pp. 807–814.
[37] R. Collobert and J. Weston, ‘‘A unified architecture for natural language
processing: Deep neural networks with multitask learning,’’ in Proc. 25th
Int. Conf. Mach. Learn., 2008, pp. 160–167.
[38] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, ‘‘Language modeling
with gated convolutional networks,’’ in Proc. 34th Int. Conf. Mach. Learn.,
vol. 70, 2017, pp. 933–941.
[39] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ‘‘BERT: Pre-training
of deep bidirectional transformers for language understanding,’’ 2018, ZHENG WANG received the bachelor’s and
arXiv:1810.04805. [Online]. Available: https://arxiv.org/abs/1810.04805 Ph.D. degrees from Zhejiang University, China,
[40] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, in 2017 and 2011, respectively. From 2014 to
L. Kaiser, and I. Polosukhin, ‘‘Attention is all you need,’’ in Proc. Adv. 2015, he was a Visiting Scholar with DLR,
Neural Inf. Process. Syst., 2017, pp. 5998–6008. German Aerospace Center funded by CSC. He is
[41] D. Bahdanau, K. Cho, and Y. Bengio, ‘‘Neural machine translation by currently a Postdoctoral Research Fellow with
jointly learning to align and translate,’’ 2014, arXiv:1409.0473. [Online]. the School of Computer Science and Engineer-
Available: https://arxiv.org/abs/1409.0473 ing, University of Electronic Science and Tech-
[42] X. Glorot, A. Bordes, and Y. Bengio, ‘‘Deep sparse rectifier neural net-
nology of China (UESTC). His current research
works,’’ in Proc. 14th Int. Conf. Artif. Intell. Statist., 2011, pp. 315–323.
interests include cross-media analysis, computer
vision, and machine learning.