0% found this document useful (0 votes)
40 views

Going Deeper With Contextual CNN For Hyperspectral Image Classification

1) The document proposes a novel deep convolutional neural network (CNN) architecture called contextual deep CNN for hyperspectral image classification. 2) The contextual deep CNN can jointly exploit local contextual interactions between neighboring pixels by using a multi-scale convolutional filter bank as an initial component, followed by a fully convolutional network (FCN). 3) Experimental results on three benchmark hyperspectral datasets show the proposed approach achieves enhanced classification performance over current state-of-the-art methods.

Uploaded by

aaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Going Deeper With Contextual CNN For Hyperspectral Image Classification

1) The document proposes a novel deep convolutional neural network (CNN) architecture called contextual deep CNN for hyperspectral image classification. 2) The contextual deep CNN can jointly exploit local contextual interactions between neighboring pixels by using a multi-scale convolutional filter bank as an initial component, followed by a fully convolutional network (FCN). 3) Experimental results on three benchmark hyperspectral datasets show the proposed approach achieves enhanced classification performance over current state-of-the-art methods.

Uploaded by

aaa
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

1

Going Deeper with Contextual CNN for


Hyperspectral Image Classification
Hyungtae Lee, Member, IEEE, and Heesung Kwon, Senior Member, IEEE

Abstract—In this paper, we describe a novel deep convolutional


neural network (CNN) that is deeper and wider than other
existing deep networks for hyperspectral image classification.
Unlike current state-of-the-art approaches in CNN-based hy-
arXiv:1604.03519v3 [cs.CV] 9 May 2017

5x5
perspectral image classification, the proposed network, called 3x3
contextual deep CNN, can optimally explore local contextual in- + 1x1

teractions by jointly exploiting local spatio-spectral relationships


of neighboring individual pixel vectors. The joint exploitation
of the spatio-spectral information is achieved by a multi-scale
convolutional filter bank used as an initial component of the
proposed CNN pipeline. The initial spatial and spectral feature
maps obtained from the multi-scale filter bank are then combined (a) Residual learning (b) Multi-scale filter bank
together to form a joint spatio-spectral feature map. The joint
feature map representing rich spectral and spatial properties of
the hyperspectral image is then fed through a fully convolutional
network that eventually predicts the corresponding label of each
pixel vector. The proposed approach is tested on three benchmark
datasets: the Indian Pines dataset, the Salinas dataset and the
University of Pavia dataset. Performance comparison shows
...
enhanced classification performance of the proposed approach Input Output
over the current state-of-the-art on the three datasets.
(c) Fully Convolutional Network (FCN)
Index Terms—Convolutional neural network (CNN), hyper-
spectral image classification, residual learning, multi-scale filter Fig. 1. Key components of the proposed network.
bank, fully convolutional network (FCN)

I. I NTRODUCTION
with relatively fewer numbers of layers and nodes in each

R ECENTLY, deep convolutional neural networks (DCNN)


have been extensively used for a wide range of visual
perception tasks, such as object detection/classification, ac-
layer at the expense of a decrease in performance. Deeper and
wider mean using relatively larger numbers of layers (depth)
and nodes in each layer (width), respectively. Accordingly,
tion/activity recognition, etc. Behind the remarkable success the reduction of the spectral dimension of the hyperspectral
of DCNN on image/video anlaytics are its unique capabilities images is in general initially performed to fit the input data into
of extracting underlying nonlinear structures of image data as the small-scale networks by using techniques, such as principal
well as discerning the categories of semantic data contents by component analysis (PCA) [9], balanced local discriminant
jointly optimizing parameters of multiple layers together. embedding (BLDE) [3], pairwise constraint discriminant anal-
Lately, there have been increasing efforts to use deep learn- ysis and nonnegative sparse divergence (PCDA-NSD) [10],
ing based approaches for hyperspectral image (HSI) classifica- etc. However, leveraging large-scale networks is still desirable
tion [1]–[8]. However, in reality, large scale HSI datasets are to jointly exploit underlying nonlinear spectral and spatial
not currently commonly available, which leads to sub-optimal structures of hyperspectral data residing in a high dimensional
learning of DCNN with large numbers of parameters due to the feature space. In the proposed work, we aim to build a deeper
lack of enough training samples. The limited access to large and wider network given limited amounts of hypersectral
scale hyperspectral data has been preventing existing CNN- data that can jointly exploit spectral and spatial information
based approaches for HSI classification [1]–[6] from lever- together. To tackle issues associated with training a large
aging deeper and wider networks that can potentially better scale network on limited amounts of data, we leverage a
exploit very rich spectral and spatial information contained in recently introduced concept of “residual learning”, which has
hypersepctral images. Therefore, current state-of-the-art CNN- demonstrated the ability to significantly enhance the train
based approaches mostly focus on using small-scale networks efficiency of large scale networks. The residual learning [11]
Manuscript received October 05, 2016; revised April 13, 2017. basically reformulates the learning of subgroups of layers
Hyungtae Lee is with Booz Allen Hamilton Inc., McLean, VA, 22102 USA called modules in such a way that each module is optimized
(e-mail: lee [email protected]). by the residual signal, which is the difference between the
Heesung Kwon is with the Image processing branch, the Sensors &
Electron Devices Directorate (SEDD), Army Research Laboratory, Adalphi, desired output and the module input, as shown in Figure 1a.
MD, 20783 USA (e-mail: [email protected]). It is shown that the residual structure of the networks allows
2

for considerable increase in depth and width of the network • We present a novel deep CNN architecture that can
leading to enhanced learning and eventually improved gener- jointly optimize the spectral and spatial information of
ation performance. Therefore, the proposed network does not hyperspectral images.
require pre-processing of dimensionality reduction of the input • The proposed work is one of the first attempts to success-
data as opposed to the current state-of-the art techiniques. fully use a very deep fully convolutional neural network
To achieve the state-of-the art performance for HSI classi- for hyperspectral classification.
fication, it is essential that spectral and spatial features are The remainder of this paper is organized as follows. In
jointly exploited. As can be seen in [1]–[3], [7], [8], the Section II, related works are described. Details of the proposed
current state-of-the-art approaches for deep learning based HSI network are explained in Section III. Performance comparisons
classification fall short of fully exploiting spectral and spatial among the proposed network and current sate-of-the-art ap-
information together. The two different types of information, proaches are described in Section IV. The paper is concluded
spectral and spatial, are more or less acquired separately in Section V.
from pre-processing and then processed together for feature
extraction and classification in [1], [7]. Hu et al. [2] also failed II. R ELATED W ORKS
to jointly process the spectral and spatial information by only
using individual spectral pixel vectors as input to the CNN. In A. Going deeper with Deep CNN for object detec-
this paper, inspired by [12], we propose a novel deep learning tion/classification
based approach that uses fully convolutional layers (FCN) [13] LeCun, et al. introduced the first deep CNN called LeNet-
to better exploit spectral and spatial information from hyper- 5 [15] consisting of two convolutional layers, two fully
spectral data. At the initial stage of the proposed deep CNN, connected layers, and one Gaussian connection layer with
a multi-scale convolutional filter bank conceptually similar additional several layers for pooling. With the recent advent
to the “inception module” in [12] is simultaneously scanned of large scale image databases and advanced computational
through local regions of hyperspectral images generating initial technology, relatively deeper and wider networks, such as
spatial and spectral feature maps. The multi-scale filter bank AlexNet [16], began to be constructed on large scale image
is basically used to exploit various local spatial structures datasets, such as ImageNet [17]. AlexNet used five convo-
as well as local spectral correlations. The initial spatial and lutional layers with three subsequent fully connected layers.
spectral feature maps generated by applying the filter bank Simonyan and Zisserman [18] significantly increased the depth
are then combined together to form a joint spatio-spectral of Deep CNN, called VGG-16, with 16 convolutional layers.
feature map, which contains rich spatio-spectral characteristics Szegedy et al. [12] introduced a 22 layer deep network called
of hyperspectral pixel vectors. The joint feature map is in turn GoogLeNet, by using multi-scale processing, which is realized
used as input to subsequent layers that finally predict the labels by using a concept of “inception module.” He et al. [11] built
of the corresponding hyperspectral pixel vectors. a network substantially deeper than those used previously by
The proposed network1 is an end-to-end network, which is using a novel learning approach called “residual learning”,
optimized and tested all together without additional pre- and which can significantly improve training efficiency of deep
post-processing. The proposed network is a fully convolutional networks.
network (FCN) [13] (Figure 1c) to take input hyperspectral
images of arbitrary size and does not use any subsampling B. Deep CNN for Hyperspectral Image Classification
(pooling) layers that would otherwise result in the output
with different size than the input; this means that the network A large number of approaches have been developed to tackle
can process hyperspectral images with arbitrary sizes. In this HSI classification problems [4], [19]–[42]. Recently, kernel
work, we evaluate the proposed network on three benchmark methods, such as multiple kernel learning [19]–[25], have been
datasets with different sizes (145×145 pixels for the Indian widely used primarily because they can enable a classifier to
Pines dataset, 610×340 pixels for the University of Pavia learn a complex decision boundary with only a few parameters.
dataset, and 512×217 for the Salinas dataset). The proposed This boundary is built by projecting the data onto a high-
network is composed of three key components; a novel fully dimensional reproducing kernel Hilbert space [43]. This makes
convolutional network, a multi-scale filter bank, and residual it suitable for exploiting dataset with limited training samples.
learning as illustrated in Figure 1. Performance comparison However, recent advance of deep learning-based approaches
shows enhanced classification performance of the proposed has shown drastic performance improvements because of its
network over the current state-of-the-art on the three datasets. capabilities that can exploit complex local nonlinear structures
of images using many layers of convolutional filters. To
The main contributions of this paper are as follows:
date, several deep learning-based approaches [1]–[6] have
• We introduce the deeper and wider network with the been developed for HSI classification. But few have achieved
help of “residual learning” to overcome sub-optimality breakthrough performance due mainly to sub-optimal learning
in network performance caused primarily by limited caused by the lack of enough training samples and the use of
amounts of training samples. relatively small scale networks.
Deep learning approaches normally require large scale
1 A preliminary version of this paper [14] was presented at the 2016 IEEE datasets whose size should be proportional to the number
International Geoscience and Remote Sensing Symposium (IGARSS 2016). of parameters used by the network to avoid overfitting in
3

55
5 27
224
5 13 13 13
3
27 3 3
3 13 3 dense dense dense
13 13
13 55 3 Softmax
ReLU ReLU Max
1000
LRN LRN 384 384 256 pooling
13
224 Max 256 Max ReLU ReLU ReLU
pooling pooling Dropout Dropout
96 ReLU ReLU 4096 4096
Stride
of 4
3

Fig. 2. AlexNet [16]. The network consists of five convolutional layers and three fully connected layers. In the illustration, cubes and boxes indicate data
blobs. Several non-linear functions are also used in the network. Non-linear functions are listed beside the output blobs of each layer in order.

learning the network. Chen et al. [1] used stacked autoencoders features from the input ranging from low to high level features.
(SAE) to learn deep features of hyperspectral signatures in Non-linearity in each layer is achieved by applying a nonlinear
an unsupervised fashion followed by logistic regression used activation function to the output of local convoultional filters in
to classify extracted deep features into their appropriate ma- each layer. The proposed network is basically a convolutional
terial categories. Both a representative spectral pixel vector neural network with a nonlinear activation function used
and the corresponding spatial vector obtained from applying in [16].
principle component analysis (PCA) to hyperspectral data over In this section, we first describe the architecture of AlexNet,
the spectral dimension are acquired separately from a local a widely used deep CNN model, as shown in Figure 2,
region and then jointly used as an input to the SAE. In [7], to provide the basis for understanding the architecture of
Chen et al. replaced SAE by a deep belief network (DBN), the proposed network. AlexNet consists of five convolutional
which is similar to the deep convolutional neural network layers and three fully connected layers. Each fully connected
for HSI classification. Li et al. [8] also used a two-layer layer contains linear weights WF C connecting the relationship
DBN but did not use initial dimensionality reduction, which between input x and output y:
would inevitably cause the loss of critical information of
hyperspectral images. Hu et al. [2] fed individual spectral y = WF C · x, (1)
pixel vectors independently through simple CNN, in which
local convolutional filters are applied to the spectral vectors where x and y represent the input and output vectors. A
extracting local spectral features. Convolutional feature maps convolutional layer with N local filters, WC,i , i = 1, 2, ..., N ,
generated after max pooling are then used as the input to the extracts local nonlinear features from the input and is ex-
fully connected classification stage for material classification. pressed as:
Chen et al. [4] also used deep convolutional neural network
adopting five convolutional layers and one fully connected y = {WC,i ∗ x}i=1,2,...,N , (2)
layer for hyperspectral classification.
Unlike these deep learning-based approaches, we first at- where ∗ denotes a convolution. The filter size of all
tempt to build much deeper and wider network using rela- {WC,i }i=1,2,...,N is carefully determined to be much smaller
tively small amounts of training samples. Once the network than the size of WF C .
is effectively optimized, it is expected to provide enhanced In [16], several non-linear components, such as the local re-
performance over relatively shallow and narrow networks. sponse normalization (LRN), max pooling, the rectified linear
unit (ReLU), dropout, and softmax are used. LRN normalizes
III. T HE C ONTEXTUAL D EEP C ONVOLUTIONAL N EURAL each activation ai over local activations of n adjacent filters
N ETWORK centered on the position (px , py ), which aims to generalize
In this section, we first describe the widely used CNN filter responses,
model referred to as AlexNet and then discuss the overall
architecture of the proposed network. We elaborate on the  i+n/2
X  2  β
two key components of the proposed network, “multi-scale a∗i (px , py ) = ai (px , py )/ k + α aj (px , py ) ,
convolutional filter bank” and “residual learning.” The learning j=i−n/2
process of the network is discussed at the end of the section. (3)
where k, n, α, and β are hyper-parameters. Max pooling down-
A. Deep Convolutional Neural Network samples the output of layers by replacing a sub-region of the
A widely used deep CNN model includes multiple layers of output with the maximum value, which is commonly used
neurons, each of which extracts a different level of non-linear for dimensionality reduction in CNN. ReLU rectifies negative
4

argmax

5x5
W

3x3

H
+ +
1x1
B 384 128 128 128 128 128 128 128 128 128 # label

128

Conv
5x5
Input Depth Conv Conv Conv SUM Conv Conv SUM Conv Conv Conv arg
Output
Image Concat 1x1 1x1 1x1 1x1 1x1 1x1 1x1 1x1 max
Conv ReLU ReLU ReLU ReLU ReLU ReLU ReLU ReLU
3x3 MAX LRN LRN Dropout Dropout
pooling
Conv
1x1
MAX
pooling

Fig. 3. An illustration of the architecture of the proposed network. The first row illustrates input and output blobs of convolutional layers and their
connections. The number of filters of each convolutional layer is indicated under its output blob. The second row shows a flow chart of the network.

values to zero and is used for the network to learn parameters Convolutional layer Fully connected layer

with positive activations only. ReLU basically replaces the 1


... 1 1 1

1 1 1 1

...
sigmoid function commonly used for other neural networks d
vector product
d
xw1k

...

...
mainly because learning deep CNN with ReLU is several xwdk l

...
l

times faster than the network with other nonlinear activation Output (dim: 1x1xl)
Input (dim: 1x1xd) Output (dim: 1x1xl) Input (dim: 1x1xd)
functions such as tanh. Dropout is a function that forces the 1
1

output of individual nodes of each layer to be zero with a ... d


...
probability under a certain threshold, which takes any value
within (0, 1). In this work, we used a threshold of 0.5. l Convolutional filters (dim: 1x1xd)

Dropout reduces overfitting by preventing multiple adaptations


of training data simultaneously (referred to as “complex co- Fig. 4. Convolutionalized model. For pixel classification, a convolutional
layer can achieve the same effect as the fully connected layer with the same
adaptions”). Softmax is a generalization of the logistic func- number of weights. In the above illustration, the convolutional layer uses
tion, which is defined as the gradient-log-normalizer of the l convolutional filters whose dimension is 1 × 1 × d and weights of the
categorical probability distribution: fully connected layer is {wi,j }i=1,···,d,j=1,···,l . Both convolutional layer
and fully connected layer use d × l weights.
efj (x)
P (y = j|x, {fk }k=1,2,···,K ) = PK , (4)
fk (x)
k=1 e
The output of the first two convolutional layers is normalized
where fj is a classification function for a j th class, whose by LRN. Note that the height and width of all data blobs in
input and output are x and y, respectively. Therefore, softmax the architecture are the same and only their depth changes.
is useful for probabilistic multiclass classification including No dimensionality reduction is performed throughout the FCN
HSI classification. processing.
Note that convolving a 1 × 1 × d blob with l filters
B. Architecture of the Proposed Network whose size is 1 × 1 × d can achieve the same effect as fully
We propose a novel fully convolutional network (FCN) [13] connecting the 1 × 1 × d input blob to l output nodes, as
with a number of convolutional layers for HSI classification, as illustrated in Figure 4. Due to this “convolutionalized model”,
show in Figure 3. The first part of the network is a “multi-scale FCN can be used for pixel classification, such as semantic
filter bank” followed by two blocks of convolutional layers segmentation, HSI classification, etc. Since our network is
associated with residual learning. The last three convolutional based on FCN, the proposed network learns on 5 × 5 pixels
layers function in a similar manner to the fully conected layers centered on individual pixel vectors and is applied to the
for classification of the AlexNet, which performs classification whole image in test.
using local features. Similar to AlexNet, the 7th and 8th
convolutional layers have dropout in training. The ReLU is How Much Deeper Does the Proposed Network Go? The
used after the multi-scale filter bank, the 2th , 3rd , 5th , 7th , proposed network contains a total of 9 layers, which is much
8th convolutional layers, and two residual learning modules. deeper than other CNNs for HSI classification trained on the
5

TABLE I However, since the size of the feature maps from the three
C OMPARISON OF NETWORK VARIABLES OF VARIOUS CNN S FOR BOTH convolutional filters is different from each other, a strategy to
IMAGE AND HSI CLASSIFICATION .
adjust the size of the feature maps to be same to combine them
Method # of Layer param data size param/data into a joint feature map is needed. First, a space of two-pixel
width filled with zeros is padded around the input image such
AlexNet [16] 8 59.3M 12M 4.94
that the size of the feature maps from the 1×1, 3×3, and 5×5
VGG16 [18] 16 135.1M 12M 11.26
filters becomes (H + 4, W + 4), (H + 2, W + 2), and (H, W ),
GoogLeNet [12] 22 6.8M 12M 0.57
respectively. H and W are the height and width of the input
ResNet152 [11] 152 56.0M 12M 4.66
image, respectively. The size of all the feature maps becomes
[2]-Indian Pines 3 79.5K 1.6K 49.69 (H, W ) after 5 × 5 and 3 × 3 max poolings are applied to the
[2]-Salinas 3 80.3K 3.1K 25.90 feature maps from the 1 × 1 and 3 × 3 filters, respectively.
[2]-U. of Pavia 3 59.8K 1.8K 33.22 3 × 3 and 5 × 5 convolutions with a large number of
The Proposed-Indian Pines 9 1122.5K 6.4K 175.39 spectral bands can be expensive and merging of the output
The Proposed-Salinas 9 1875.8K 12.4K 151.27 of the convolutional filter bank causes the size of the
The Proposed-U. of Pavia 9 610.6K 7.2K 84.81 network to increase, which also inevitably leads to high
computational complexity. As the network size is increased,
optimizing the network with a small number of training
samples will face overfitting and divergence. Therefore, a
same datasets [2]. However, the depth of 9 still does not seem
strategy to address the above issues needs to be used. To
to be large enough, especially when compared to the current
tackle the issues, we use training data augmentation and
state-of-the-art CNNs for image classification, such as ResNet
residual learning modules described in Section III-D and III-E.
[11]. This is mainly because HSI-based CNNs have to be
trained on much smaller amounts of training samples than that
Functionality of the Multi-scale Filter Bank. The multi-scale
of the image classification CNNs primarily trained on large
filter bank conceptually similar to the inception module in [12]
scale databases, such as ImageNet (1.2 M) [17]. Constrained
is used to optimally exploit diverse local structures of the input
by highly limited HSI training data, the proposed going deeper
image. [12] demonstrates the effectiveness of the inception
strategy opts not to use a very large number of layers to
module that enables the network to get deeper as well as to
avoid overfitting. However, it still uses a much greater number
exploit local structures of the input image achieving state-of-
of layers than that of any other HSI-based CNNs. Table I
the-art performance in image classification. The multi-scale
shows a comparison of various CNNs for both image and HSI
filter bank in the proposed network is used in a somewhat
classification with regards to network variables, such as the
different manner that aims to jointly exploit local spatial
number of layers and parameters, training data size, and a
structures in conjunction with local spectral correlations at the
ratio between the number of the parameters and data size.
initial stage of the proposed structure.
Similar to data augmentation used in image classification
CNNs, the proposed network also uses a data augmentation
strategy described in Section III-E. As shown Table I, the D. Residual Learning
proposed network provides much larger ratios between the
The subsequent convolutional layers use 1 × 1 × B filters to
number of parameters and training data size than those of
extract nonlinear features from the joint spatio-spectral feature
the baseline [2] for the same training dataset. Also, the
map. We use two modules of “residual learning” [11], which
parameter vs. data ratios of the proposed networks are at
is shown to help significantly improve training efficiency of
least approximately eight times larger than that of any image
deep networks. The residual learning is to learn layers with
classification CNNs. This indicates that the architecture of
reference to the layer input using the following formula:
the proposed network is designed to ensure that it provides
sufficient depth of layers to fully exploit training data.
y = F(x, {Wi }) + x, (5)

where x and y are the input and output vectors of the layers
C. Multi-scale Filter Bank
considered, respectively. The function F := y − x is the
The first convolutional layer applied to the input hyperspec- residual mapping of the input to the residual output y − x
tral image uses a multi-scale filter bank that locally convolves using convolutional filters Wi . [11] proved that it is easier
the input image with three convolutional filters with different to optimize Wi with the residual mapping than to optimize
sizes (1 × 1 × B, 3 × 3 × B, and 5 × 5 × B where B is the those weights with the unreferenced mapping. In the proposed
number of spectral bands). The 3 × 3 × B and 5 × 5 × B network, two convolutional layers are used for the residual
filters are used to exploit local spatial correlations of the input mapping, which is called “shortcut connections”. The residual
image while the 1 × 1 × B filters are used to address spectral learning is very effective in practice, which is also proven
correlations. The output of the first convolutional layer, the in [11]. ReLU is the function that makes the first layer in the
three convolutional feature maps, as shown in Figure 3, are module nonlinear. Note that both the multi-scale filter bank
combined together to form a joint spatio-spectral feature map and the residual learning are effective in increasing the depth
used as input to the subsequent convolutional layers. and width of the network while keeping the computational
6

woods

Groundtruth
Learning + +

Contextual Deep CNN


5x5

Training data augmentation


Hyperspectral Image

5x5v 5x5h 5x5d

Fig. 5. The learning process of the proposed network. In the hyperspectral image, 1×1 training pixel and its neighboring 5×5 pixels are indicated by a
red and white rectangle, respectively. In the red box representing augmented training data, 5×5v , 5×5h , and 5×5d are the training samples mirrored across
across the horizontal, vertical, and diagonal axes, respectively.

budget constrained [11], [12]. This helps to effectively learn


the deep network with a small number of training samples.

Indian Pines Salinas University of Pavia

E. Learning the Proposed Network Fig. 6. Three HSI datasets. Indian pines, Salinas, and University of Pavia
datasets. For each dataset, three-band color composite image is given on the
left and ground truth is shown on the right. In groundtruth, pixels belonged
We randomly sample a certain number of pixels from the to the same class are depicted with the same color.
hyperspectral image for training and use the rest to evaluate the
performance of the proposed network. For each training pixel,
TABLE II
we crop surrounding 5×5 neighboring pixels for learning S ELECTED CLASSES FOR EVALUATION AND THE NUMBERS OF TRAINING
convolutional layers. The proposed network contains approxi- AND TEST SAMPLES USED FROM THE I NDIAN P INES DATASET

mately 1000K parameters, which are learned from several hun-


No Class Training Test
dreds of training pixels from each material category. To avoid
overfitting, we augment the number of training samples four 1 Corn-notill 200 1228
times by mirroring the training samples across the horizontal, 2 Corn-mintill 200 630
vertical, and diagonal axes. Figure 5 illustrates the learning 3 Grass-pasture 200 283
process of the proposed network. 4 Hay-windrowed 200 278
5 Soybean-notill 200 772
For learning the proposed network, stochastic gradient
6 Soybean-mintill 200 2255
descent (SGD) with a batch size of 10 samples is used
with 100K iterations, a momentum of 0.9, a weight decay 7 Soybean-clean 200 393
of 0.0005 and a gamma of 0.1. We initially set a base 8 Woods 200 1065
learning rate as 0.001. The base learning rate is decreased Total 1600 6904
to 0.0001 after 33,333 iterations and is further reduced to
0.00001 after 66,666 iterations. To learn the network, the last
argmax layer is replaced by a softmax layer commonly used IV. E XPERIMENTAL R ESULTS
for learning convolutional layers. The first, second, and ninth
convolutional layers are initialized from a zero-mean Gaussian A. Dataset and Baselines
distribution with standard deviation of 0.01 and the remaining The performance of HSI classification of the proposed
convolutional layers are initialized with standard deviation of network is evaluated on three datasets: the Indian Pines dataset,
0.005. Biases of all convolutional layers except the last layer the Salinas dataset, and the University of Pavia dataset, as
are initialized to one and the last layer is initialized to zero. shown in Figure 6. The Indian Pines dataset consists of
7

Broccoli green weeds 1


Corn-notill Broccoli green weeds 2
Asphalt
Fallow
Corn-mintill Fallow rough plow
Meadows
Grass-pasture
Fallow smooth
Gravel
Stubble

Hay-windrowed
Celery Trees
Grapes untrained

Soil vineyard develop Sheets


Soybean-notill Corn senesced green weeds

Lettuce romaine, 4 wk
Bare soil
Soybean-mintill Lettuce romaine, 5 wk
Bitumen
Lettuce romaine, 6 wk
Soybean-clean Lettuce romaine, 7 wk
Bricks
Vineyard untrained
Woods Vineyard vertical trellis Shadows

(a) Indian Pines (b) Salinas (c) University of Pavia


Fig. 7. RGB composition maps of groundtruth (left) of each dataset and the classification results (center) from the proposed network for the dataset.

TABLE III TABLE IV


S ELECTED CLASSES FOR EVALUATION AND THE NUMBERS OF TRAINING S ELECTED CLASSES FOR EVALUATION AND THE NUMBERS OF TRAINING
AND TEST SAMPLES USED FROM THE S ALINAS DATASET AND TEST SAMPLES USED FROM THE U NIVERSITY OF PAVIA DATASET

No Class Training Test No Class Training Test


1 Broccoli green weeds 1 200 1809 1 Asphalt 200 6431
2 Broccoli green weeds 2 200 3526 2 Meadows 200 18449
3 Fallow 200 1776 3 Gravel 200 1899
4 Fallow rough plow 200 1194 4 Trees 200 2864
5 Fallow smooth 200 2478 5 Sheets 200 1145
6 Stubble 200 3759 6 Bare soils 200 4829
7 Celery 200 3379 7 Bitumen 200 1130
8 Grapes untrained 200 11071 8 Bricks 200 2482
9 Soil vineyard develop 200 6003 9 Shadows 200 747
10 Corn senesced green weeds 200 3078 Total 1800 40976
11 Lettuce romaines, 4 wk 200 868
12 Lettuce romaines, 5 wk 200 1727
13 Lettuce romaines, 6 wk 200 716
14 Lettuce romaines, 7 wk 200 870
datasets, an approach using diversified Deep Belief Networks
15 Vineyard untrained 200 7068
(D-DBN) [6] provides higher HSI classification accuracy than
16 Vineyard vertical trellis 200 1607
that of the network in [2]. We also use D-DBN as a baseline
Total 3200 50929
in this work. For the Indian Pines dataset, we also use three
types of neural networks evaluated in [2]: a two layer fully
connected neural network (Two-layer NN), a fully connected
145×145 pixels and 220 spectral reflectance bands covering neural network with one hidden layer (Three-layer NN), and
the range from 0.4 to 2.5 µm with a spatial resolution of 20 m. the classic LeNet-5 [15].
The Indian Pines dataset originally has 16 classes but we only For a fair comparison, we randomly select 200 samples from
use 8 classes with relatively large numbers of samples. The each class and use them as training samples as in [2]. The rest
Salinas dataset consists of 512×217 pixels and 224 spectral are used for testing the proposed network. The selected classes
bands. It contains 16 classes and is characterized by a high and the numbers of training and test samples of the three
spatial resolution of 3.7 m. The University of Pavia dataset datasets are listed in Tables II, III, and IV. In the literature
contains 610×340 pixels with 103 spectral bands covering the on HSI classification, different train/test dataset partitions are
spectral range from 0.43 to 0.86 µm with a spatial resolution used to evaluate their approaches. Among them, our dataset
of 1.3 m. 9 classes are in the dataset. For the Salinas dataset partition using 200 training samples has two advantages in
and the University of Pavia dataset, we use all classes because evaluating the proposed network; i) evaluation with this par-
both datasets do not contain classes with a relatively small tition can verify our contribution, which is building a deeper
number of samples. and wider network with a relatively small number of training
We compare the performance of the proposed network to samples and ii) [2] using this partition can provide reasonable
the one reported in [2] that used a different deep CNN archi- performance of relatively good baselines, such as RBF-SVM
tecture and RBF kernel-based SVM on the three hyperspectral and the shallower CNN. For all experiments, we perform the
datasets. The deep CNN used in [2] consists of two convolu- random train/test partition 20 times and report mean and stand
tional layers and two fully connected layers, which is much deviation of overall classification accuracy (OA). We have
shallower than our proposed network with nine convolutional carried out all the experiments on Caffe framework [44] with
layers. Currently, for the Indian Pines and University of Pavia a Titan X GPU.
8

TABLE V
C OMPARISON OF HYPERSPECTRAL CLASSIFICATION PERFORMANCE AMONG THE PROPOSED NETWORK AND THE BASELINES ON THREE DATASETS ( IN
PERCEPTAGE ). T HE BEST PERFORMANCE AMONG 20 TRAIN / TEST PARTITIONS IS SHOWN IN PARENTHESES . T HE BEST PERFORMANCE AMONG ALL
METHODS IS INDICATED IN BOLD FONT.

Performance
Method
Indian Pines Salinas University of Pavia
Two-layer NN [2] 86.49 · ·
RBF-SVM [2] 87.60 91.66 90.52
Three-layer NN [1], [2] 87.93 · ·
LeNet-5 [2], [15] 88.27 · ·
Shallower CNN [2] 90.16 92.60 92.56
D-DBN [6] 91.03 ± 0.12 · 93.11 ± 0.06
The proposed network 93.61 ± 0.56 (94.24) 95.07 ± 0.23 (95.42) 95.97 ± 0.46 (96.73)

TABLE VI
P ERFORMANCE COMPARISON OF THE PROPOSED NETWORK IN PERCENTAGE W. R . T. VARYING WIDTHS ( NUMBER OF KERNELS IN EACH LAYER ).

Dataset 64 128 192 256


Indian Pines 80.38 ± 14.20 93.61 ± 0.56 93.47 ± 0.41 92.79 ± 0.81
Salinas 91.35 ± 3.62 93.60 ± 0.58 95.07 ± 0.23 94.10 ± 0.55
University of Pavia 94.77 ± 0.83 95.97 ± 0.46 95.86 ± 0.50 95.78 ± 0.52

TABLE VII TABLE VIII


T RAINING TIME ( IN SECOND ) OF THE PROPOSED NETWORK W. R . T. P ERFORMANCE COMPARISON OF THE PROPOSED NETWORK IN
VARYING WIDTHS ( NUMBER OF KERNELS IN EACH LAYER ). PERCENTAGE W. R . T. VARYING DEPTHS ( NUMBER OF RESIDUAL LEARNING
MODULES ).
Dataset 64 128 192 256
Dataset 1 2 3
Indian Pines 351 482 576 738
Salinas 428 598 696 896 Indian Pines 92.74 ± 0.69 93.61 ± 0.56 92.63 ± 0.84

University of Pavia 349 474 597 751 Salinas 94.06 ± 0.26 95.07 ± 0.23 94.01 ± 0.47
University of Pavia 95.63 ± 0.50 95.97 ± 0.46 95.66 ± 0.59

TABLE IX
T RAINING TIME ( IN SECOND ) OF THE PROPOSED NETWORK W. R . T.
B. HSI Classification VARYING DEPTHS ( NUMBER OF RESIDUAL LEARNING MODULES ).

Dataset 1 2 3
Table V shows a performance comparison among the pro- Indian Pines 431 482 549
posed network and baselines on the datasets. Hu et al. [2] only Salinas 616 696 777
reports a single instance of classification performance without University of Pavia 426 474 544
indicating if the value is the best or mean accuracy of mul-
tiple evaluations. The proposed network provided improved
performance over all the baselines on all datasets. The mean
of classification performance of the proposed network is better C. Finding the Optimal Depth and Width of the Network
than the best baseline classification performance by 2.58 %, To find the optimal width of the proposed network, we
2.47 %, and 2.86 % for the Indian Pines dataset, the Salinas evaluate the network by varying the number of convolutional
dataset, and the University of Pavia dataset, respectively. This filters (i.e., the number of kernels): 64, 128, 192, and 256
performance enhancement was achieved mainly by building for all three datasets. Table VI shows the performance of
a deeper and wider network as well as jointly exploiting the the proposed network with the varying numbers of kernels
spatio-spectral information of the hyperspectral data. Residual (network width) while Table VII shows training time for all
learning also helped improve the performance by optimizing cases. For the Indian Pines dataset and the University of Pavia
training efficiency on a relatively small number of samples. dataset, 128 is the optimal width for the best performance
The groundtruth map (left) and the classification map (right) while 192 is the best one for the Salinas dataset. Since the
obtained by the proposed network for all datasets are also Salinas dataset contains more training samples from the larger
shown in Figure 7. The classification map is drawn from one number of classes than other datasets, more weights seem to
arbitrary train/test partition among 20. be necessary to achieve optimal performance. As shown in
9

1x1 output 1x1 output 1x1 output 1x1 output

Filter Filter Filter


concatenation concatenation concatenation

3x3 max pooling 5x5 max pooling 3x3 max pooling 7x7 max pooling 5x5 max pooling 3x3 max pooling

1x1 convolutions 1x1 convolutions 3x3 convolutions 1x1 convolutions 3x3 convolutions 5x5 convolutions 1x1 convolutions 3x3 convolutions 5x5 convolutions 7x7 convolutions

1x1 input 3x3 input 5x5 input 7x7 input

1x1 ~3x3 ~5x5 ~7x7

Fig. 8. Architecture of various multi-scale filter banks.

TABLE X
P ERFORMANCE COMPARISON OF THE PROPOSED NETWORK ( IN PERCENTAGE ) W. R . T. MULTI - SCALE FILTER BANKS WITH DIFFERENT CONFIGURATIONS .
∼ 7 × 7 MEANS THE MULTI - SCALE FILTER BANK CONSISTING OF 1×1, 3×3, 5×5, AND 7×7 CONVOLUTION FILTERS .

Dataset 1×1 ∼3×3 ∼5×5 ∼7×7


Indian Pines 53.67 ± 16.63 87.37 ± 4.12 93.61 ± 0.56 93.47 ± 0.77
Salinas 50.62 ± 30.87 92.08 ± 0.77 95.07 ± 0.23 94.20 ± 0.43
University of Pavia 65.62 ± 8.18 93.59 ± 1.35 95.97 ± 0.46 95.91 ± 0.50

Table VI and VII, adding more filters to the optimal network 5×5, and 7×7. Figure 8 shows architectures of all various
not only causes reduction in performance but also results in multi-scale filter banks. As shown in Table XII, the multi-
an increase in computational cost. scale filter bank significantly outperforms the network without
We also evaluate the proposed network with various depths it (1x1 only) for all the three datasets (by 39.94 % for the
in order to find the optimal depth. Depth can be varied Indian Pines dataset, 44.45 % for the Salinas dataset, and
by using different numbers of residual learning modules. 30.35 % for the University of Pavia in mean classification
Performance comparison of the proposed network with varying performance). The drastic performance degradation is mainly
numbers of residual learning modules is shown in Table VIII. caused by two reasons; i) no joint exploitation of the spatio-
Table IX shows training time for all cases. For all the three spectral information is performed and ii) data augmentation
datasets, using two residual learning modules achieves the by mirroring local regions cannot be used due to the non-
best performance among all variations. Using three residual existence of spatial filtering.
learning modules may face an overfitting issue, which results We also compare the proposed network to the one multi-
in performance degradation. It is also shown in Table IX scale filter banks with different configurations. As shown in
that using three residual learning modules turns out to be Table XII, The performance degradation from using the multi-
computationally very expensive. scale filter bank with all the filters up to 7×7 denoted by
On the basis of these evaluations, we choose the network ∼7×7 is caused by ’spillover’ near class boundaries resulted
with two residual learning modules and the width of 128 for from using the spatial filter of 7×7. Therefore, we choose to
each layer for both the Indian Pines dataset and the University use a multi-scale filter bank with 1×1, 3×3, and 5×5 for the
of Pavia dataset. For the Salinas dataset, the network with two proposed network.
residual learning modules and the width of 192 for each layer
is selected.
E. Effectiveness of Residual Learning
To verify the effectiveness of the “residual learning”, we
D. Effectiveness of the Multi-scale Filter Bank also compare the performance of the proposed network to a
To verify the effectiveness of the multi-scale filter bank used similar network with the first residual module replaced with
to jointly exploit the spatio-temporal information together, we regular two convolutional layers, as shown in Table XI. Both
compare the proposed network to the network without the the networks are built on the same number of convolutional
multi-scale filter bank, which use only a 1×1 filter in the layers, which is 9. It was found that the network without
first layer. We also compare to the network with the multi- using residual learning modules at all failed to converge
scale filter bank with a different configuration: 1×1, 3×3, in training due mainly to the small size training data. The
10

1 1
10 10 1
10
w/ residual learning w/ residual learning w/ residual learning
w/o residual learning w/o residual learning w/o residual learning
0 0 0
10 10 10
loss (in log)

loss (in log)

loss (in log)


−1 −1 −1
10 10 10

−2 −2 −2
10 10 10

−3 −3 −3
10 10 10

−4 −4 −4
10 10 10
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Iteration x 10
4
Iteration x 10
4
Iteration x 10
4

1 1 1

0.9 0.9
0.9
0.8 0.8

0.8 0.7 0.7

Accuracy (%)
Accuracy (%)
Accuracy (%)

0.6 0.6
0.7
0.5 0.5

0.6
0.4 0.4

0.3 0.3
0.5
0.2 0.2
0.4
w/ residual learning 0.1 w/ residual learning 0.1 w/ residual learning
w/o residual learning w/o residual learning w/o residual learning
0 0
0 2 4 6 8 10 0 2 4 6 8 10 0 2 4 6 8 10
Iteration x 10
4 Iteration x 10
4 Iteration x 10
4

(a) Indian Pines (b) Salinas (c) University of Pavia

Fig. 9. Evaluation of effectiveness of residual learning. Training loss (top) and classification accuracy (bottom) on three datasets with the proposed network
and the network with the first residual learning module replaced with two convolutional layers are provided as a function of training iterations. Note that
‘w/ residual learning’ is the proposed architecture and ‘w/o residual learning’ is the modified architecture replacing the first residual learning modules with
regular two nonlinear layers as two sequential convolutional layers with the same nonlinear layers.

network with the first residual learning module replaced with TABLE XI
two convolutional layers also failed to optimize the network C LASSIFICATION PERFORMANCE COMPARISON OF THE PROPOSED
NETWORK AND THE NETWORK WITH THE FIRST RESIDUAL LEARNING
parameters resulting in sub-optimal performance, as shown in MODULE REPLACED WITH REGULAR CONVOLUTIONAL LAYERS ( IN
Table XI. Figure 9 shows the comparison of training loss and PERCENTAGE ).
classification accuracy as a function of training iterations for
the two networks, which are calculated from one arbitrary Dataset w/ conv. layer w/ residual learning
train/test partition. From the training loss in the plots of the Indian Pines 49.73 ± 24.58 93.61 ± 0.56
first row of Figure 9, we observe that the proposed network Salinas 46.75 ± 25.98 95.07 ± 0.23
achieves lower loss both during learning and at the end University of Pavia 50.23 ± 27.78 95.97 ± 0.46
of the iterations than the other network. The second row
of the Figure 9 also shows that lower loss during learning
leads to improved classification accuracy. These observations
support that residual learning greatly improves overall learning
efficiency resulting in both lower training loss and higher
classification accuracy. learning with 800 examples per a class because several classes
have insufficient examples (e.g. 483 for Grass-pasture, 478 for
Hay-windrowed, 593 for Soybean-clean).
F. Performance Changes according to Training Set Size
To analyze the effects of training dataset size in learning As expected, the classification accuracy of the proposed
the proposed network, we compare the performance of the network monotonically increases as training dataset size in-
proposed network as the size of training dataset is changed: 50, creases. We also note that even for smaller training dataset
100, 200, 400, or 800 examples per a class. Table XII presents size, such as 50 and 100, the proposed network provides
classification accuracy of the proposed network w.r.t. training higher accuracy than multiple kernel learning (MKL)-based
dataset size. For the Indian Pines dataset, we do not perform HSI classification [20], as shown in Table XII.
11

TABLE XII
P ERFORMANCE COMPARISON OF THE PROPOSED NETWORK ( IN PERCENTAGE ) W. R . T. THE NUMBER OF TRAINING EXAMPLES PER A CLASS .

Dataset Method 50 100 200 400 800


MKL [20] 77.40 ± 1.78 80.63 ± 0.99 · · ·
Indian Pines
The proposed network 80.50 ± 3.93 87.39 ± 0.88 93.61 ± 0.56 94.68 ± 0.47 ·
MKL [20] 89.33 ± 0.44 90.60 ± 0.43 · · ·
Salinas
The proposed network 91.36 ± 1.11 93.15 ± 0.43 95.07 ± 0.23 96.55 ± 0.29 97.14 ± 0.53
MKL [20] 91.52 ± 0.98 92.72 ± 0.33 · · ·
University of Pavia
The proposed network 91.39 ± 0.80 93.10 ± 0.45 95.97 ± 0.46 96.81 ± 0.25 97.31 ± 0.26

G. False Positives Analysis than other existing convolutional networks for HSI classifica-
tion. It is well known that a suitably optimized deeper network
Table XIII shows confusion matrices for three datasets,
can in general lead to improved performance over shallower
which are calculated from one arbitrary train/test partition.
networks. To enhance the learning efficiency of the proposed
For the Indian Pines dataset, the proposed network presents
network trained on a relatively sparse training samples a newly
the performance below 95 % in only two classes that are
introduced learning approach called residual learning has been
corn-notill and soybean-mintill, among the eight classes. As
used. To leverage both spectral and spatial information em-
shown in the Table II, the two classes are the ones with
bedded in hyperspectral images, the proposed network jointly
much larger numbers of samples than others. The network
exploits local spatio-spectral interactions by using a multi-
learning with relatively small training data seems to fail
scale filter bank at the initial stage of the network. The multi-
to represent overall spectral characteristics of the classes.
scale filter bank consists of three convolutional filters with
Similarly, approximately 5% of false positives of each of
different sizes: two filters (3 × 3 and 5 × 5) are used to exploit
the two classes are labeled as the other class because the
local spatial correlations while 1×1 is used to address spectral
spectral distributions of the two classes are more widespread
correlations.
than others. Similar tendency is shown for the Salinas dataset.
The proposed network performed worst for the two classes As supported by the experimental results, the proposed
with more test data, which are grapes untrained and vineyard network provided enhanced classification performance on the
untrained, as shown in Table III: 83.4 % for grapes untrained three benchmark datasets over current state-of-the-art ap-
and 89.4 % for vineyard untrained. Most false positives from proaches using different CNN architectures. The improved
each of the two classes are the ones misclassified as the other performance is mainly from i) using a deeper network with
class of the two classes. For the University of Pavia dataset, enhanced training and ii) joint exploitation of spatio-spectral
the classification performance of the bricks class is noticeably information. The depth (the number of layers) and width (the
worse, which is less than 90 %. Most false positives of the number of kernels used in each layer) of the proposed network
bricks class are classified as gravels. as well as the number of residual learning modules are deter-
mined by cross validation. The classification performance also
To evaluate how the proposed network performs for pixels
shows that the proposed network with two residual learning
near boundaries between different classes, we categorized all
modules outperforms the one with only one module, which
the pixels according to the pixel distance to the boundary.
supports the effectiveness of the residual learning incorporated
Pixels on the boundary are labelled as zero. Similarly, pixels
into the proposed network.
near boundary with one pixel apart are labelled as one. The rest
are labelled as ≥ 2. Note that we use neighboring 5×5 pixels
for exploiting spatial information of each pixel. For pixels R EFERENCES
labelled as ≥ 2, their 5×5 neighboring pixels are from the
[1] Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu, “Deep learning-based
same class. Table XIV shows the number of false positives classification of hyperspectral data,” IEEE Journal of Selected Topics
versus all the test data within each pixel category for all in applied Earth Observations and Remote Sensing (J-STARS), vol. 7,
the three datasets. For all datasets, it is observed that larger no. 6, pp. 2094–2107, 2014.
[2] W. Hu, Y. Huang, L. Wei, F. Zhang, and H. Li, “Deep convolutional
portions of false positives are generated near boundaries as neural networks for hyperspectral image classification,” Journal of
expected. The false positives close to class boundaries are one Sensors, vol. 2015.
of major factors for performance degradation of the proposed [3] W. Zhao and S. Du, “Spectral-spatial feature extraction for hyper-
spectral image classification: A dimension reduction and deep learning
network. The pixels far from the boundaries by more than one approach,” IEEE Transactions on Geoscience and Remote Sensing
pixel distance are not affected by ‘spillover’ and therefore less (TGARS), vol. 54, no. 8, 2016.
prone to misclassification. [4] Y. Chen, H. Jiang, C. Li, X. Jia, and P. Ghamisi, “Deep feature extraction
and classification of hyperspectral images based on convolutional neural
networks,” IEEE Transactions on Geoscience and Remote Sensing
(TGARS), vol. 54, no. 10, pp. 6232–6251, 2016.
V. C ONCLUSION [5] P. Liu, H. Zhang, and K. Eom, “Active deep learning for classification
of hyperspectral images,” IEEE Journal of Selected Topics in applied
In the proposed work, we have built a fully convolutional Earth Observations and Remote Sensing (J-STARS), no. 10, pp. 712–
neural network with a total of 9 layers, which is much deeper 724, 2017.
12

TABLE XIII
Confusion matrix. G ROUNDTRUTH LABELS AND CLASSIFIED CLASSES ARE GIVEN ALONG x AND y AXES , RESPECTIVELY. T HE NUMBERS ALONG THE
AXES CORRESPOND TO THE CLASS NUMBERS IN TABLE II, III, AND IV FOR THE THREE DATASETS , RESPECTIVELY. P ER A CLASS , BEST ACCURACY IS
INDICATED BY BOLD FONT.

1 2 3 4 5 6 7 8
1 90.1 % 1.5 % 0.0 % 0.0 % 1.0 % 4.9 % 2.2 % 0.3 %
2 1.8 % 97.1 % 0.0 % 0.0 % 0.0 % 1.1 % 0.0 % 0.0 %
3 0.0 % 0.0 % 100.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
4 0.0 % 0.0 % 0.0 % 100.0 % 0.0 % 0.0 % 0.0 % 0.0 %
5 1.3 % 0.0 % 0.1 % 0.0 % 95.9 % 2.2 % 0.5 % 0.0 %
6 5.5 % 3.7 % 0.0 % 0.0 % 3.1 % 87.1 % 0.7 % 0.0 %
7 2.0 % 0.8 % 0.0 % 0.0 % 0.0 % 0.8 % 96.4 % 0.0 %
8 0.0 % 0.0 % 0.6 % 0.0 % 0.0 % 0.0 % 0.0 % 99.4 %
(a) Indian Pines

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 100.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
2 0.0 % 100.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
3 0.0 % 0.0 % 100.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
4 0.0 % 0.0 % 0.0 % 99.3 % 0.7 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
5 0.0 % 0.0 % 0.0 % 0.5 % 98.5 % 0.0 % 0.0 % 0.0 % 0.0 % 0.2 % 0.0 % 0.2 % 0.0 % 0.0 % 0.6 % 0.0 %
6 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 100.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
7 0.2 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 99.8 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
8 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 83.4 % 0.0 % 0.9 % 0.0 % 0.0 % 0.0 % 0.3 % 15.5 % 0.0 %
9 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 99.6 % 0.0 % 0.4 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 %
10 0.0 % 0.0 % 1.0 % 0.0 % 0.0 % 0.2 % 0.0 % 0.3 % 0.3 % 94.6 % 1.6 % 1.0 % 0.0 % 0.6 % 0.4 % 0.0 %
11 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 99.3 % 0.7 % 0.0 % 0.0 % 0.0 % 0.0 %
12 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 100.0 % 0.0 % 0.0 % 0.0 % 0.0 %
13 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 100.0 % 0.0 % 0.0 % 0.0 %
14 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 100.0 % 0.0 % 0.0 %
15 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 100.0 % 0.0 %
16 0.0 % 0.0 % 0.0 % 0.1 % 0.1 % 0.0 % 0.5 % 1.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.2 % 0.1 % 98.0 %
(b) Salinas

1 2 3 4 5 6 7 8 9
1 94.6 % 0.0 % 1.2 % 0.0 % 0.0 % 0.0 % 2.8 % 1.2 % 0.0 &
2 0.0 % 96.0 % 0.0 % 1.7 % 0.0 % 2.3 % 0.0 % 0.0 % 0.0 %
3 0.5 % 0.0 % 95.5 % 0.0 % 0.0 % 0.3 % 0.0 % 4.7 % 0.0 %
4 0.0 % 3.1 % 0.0 % 95.9 % 0.0 % 0.9 % 0.0 % 0.0 % 0.0 %
5 0.0 % 0.0 % 0.0 % 0.0 % 100.0 % 0.0 % 0.0 % 0.0 % 0.0 %
6 0.0 % 4.4 % 0.0 % 0.2 % 0.0 % 94.1 % 0.0 % 1.2 % 0.0 %
7 2.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 97.5 % 0.4 % 0.0 %
8 1.7 % 0.1 % 8.9 % 0.0 % 0.0 % 0.5 % 0.0 % 88.8 % 0.0 %
9 0.1 % 0.0 % 0.0 % 0.0 % 0.0 % 0.0 % 0.4 % 0.0 % 99.5 %
(c) University of Pavia

[6] P. Zhong, Z. Gong, S. Li, and C.-B. Sch´’ onlieb, “Learning to diversify hyperspectral data based on pairwise constraint discriminative analysis
deep belief networks for hyperspectral image classification,” IEEE and nonnegative sparse divergence,” IEEE Journal of Selected Topics in
Journal of Selected Topics in applied Earth Observations and Remote applied Earth Observations and Remote Sensing (J-STARS), no. 10, pp.
Sensing (J-STARS), no. 99, pp. 1–15, 2017. 1552–1562, 2017.
[7] Y. Chen, X. Zhao, and X. Jia, “Spectral-spatial classification of hyper- [11] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
spectral data based on deep belief network,” IEEE Journal of Selected image recognition,” in IEEE conference on Computer Vision and Pattern
Topics in applied Earth Observations and Remote Sensing (J-STARS), Recognition (CVPR), 2016.
vol. 8, no. 6, pp. 2381–2392, 2015. [12] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
[8] T. Li, J. Zhang, and Y. Zhang, “Classification of hyperspectral image V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in
based on deep belief networks,” in IEEE Conference on Image Process- IEEE conference on Computer Vision and Pattern Recognition (CVPR),
ing (ICIP), 2014. 2015.
[9] K. Pearson, “On lines and planes of closest fit to systems of points in [13] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks
space,” Philosophical Magazine, vol. 2, no. 11, pp. 559–572, 1901. for semantic segmentation,” in IEEE conference on Computer Vision
[10] X. Wang, Y. Kong, Y. Gao, and Y. Cheng, “Dimensionality reduction for and Pattern Recognition (CVPR), 2015.
13

TABLE XIV
C ATEGORIZATION OF THE FALSE POSITIVES W. R . T. THE PIXEL DISTANCE TO THE BOUNDARY

# of FP / # of test data Percentage


Dataset
0 1 ≥2 0 1 ≥2
Indian pines 93 / 717 80 / 109 310 / 5478 12.97 % 11.28 % 5.66 %
Salinas 94 / 1093 81 / 1082 2688 / 48754 8.60 % 7.49 % 5.51 %
University of Pavia 254 / 3455 299 / 4135 1737 / 33386 7.35 % 7.23 % 4.30 %

[14] H. Lee and H. Kwon, “Contextual deep cnn based hyperspectral [31] Z. Zhong, B. Fan, K. Ding, H. Li, S. Xiang, and C. Pan, “Efficient mult-
classification,” in IEEE International Geoscience and Remote Sensing ple feature fusion with hashing for hyperspectral imagery classification:
Symposium (IGARSS), 2016. A comparative study,” IEEE Transactions on Geoscience and Remote
[15] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. Howard, W. Hub- Sensing (TGARS), vol. 54, pp. 4461–4478, 2016.
bard, and L. Jackel, “Backpropagation applied to handwritten zip code [32] J. Xia, L. Bombrun, T. Adali, Y. Berthoumieu, and C. Germain,
recognition,” Nerual Computation, vol. 1, pp. 541–551, 1989. “Spectral-spatial classification of hyperspectral images using ica and
[16] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification edge-preserving filter via an ensemble strategy,” IEEE Transactions on
with deep convolutional neural networks,” in Conference on Neural Geoscience and Remote Sensing (TGARS), vol. 54, pp. 4971–4982,
Information Processing Systems (NIPS), 2012. 2016.
[17] J. Deng, W. Dong, L. J. J. R. Socher, K. Li, and L. Fei-Fei, “Imagenet: [33] H. Yang and M. Crawford, “Spectral and spatial proximity-based
A large-scale hierarchical image database,” in IEEE conference on manifold alignment for multitemporal hyperspectral image classifica-
Computer Vision and Pattern Recognition (CVPR), 2009. tion,” IEEE Transactions on Geoscience and Remote Sensing (TGARS),
[18] K. Simonyan and A. Zisserman, “Very deep convolutional networks for vol. 54, pp. 51–64, 2016.
large-scale image recognition,” in International Conference on Learning [34] M. Toksöz and Í. Ulusoy, “Hyperspectral image classification via basic
Representations (ICLR), 2015. thresholding classifier,” IEEE Transactions on Geoscience and Remote
[19] P. Gurram and H. Kwon, “Sparse kernel-based ensemble learning with Sensing (TGARS), vol. 54, pp. 4039–4051, 2016.
fully optimized kernel parameters for hyperspectral classification prob- [35] P. Zhong and R. Wang, “Learning conditional random fields for classifi-
lems,” IEEE Transactions on Geoscience and Remote Sensing (TGARS), cation of hyperspectral images,” IEEE Transactions on Image Processing
vol. 51, pp. 787–802, 2013. (TIP), vol. 19, pp. 1890–1907, 2010.
[20] Y. Gu, T. Liu, X. Jia, J. A. Benediktsson, and J. Chanussot, “Nonlin- [36] K. Bernard, Y. Tarabaika, J. Angulo, J. Chanussot, and J. A. Benedik-
ear multiple kernel learning with multiple-structure-element extended tsson, “Spectral-spatial classification of hyperspectral data based on a
morphological profiles for hyperspectral image classification,” IEEE stochastic minimum spanning forest approach,” IEEE Transactions on
Transactions on Geoscience and Remote Sensing (TGARS), vol. 54, pp. Image Processing (TIP), vol. 21, pp. 2008–2021, 2012.
3235–3247, 2016. [37] Y. Gao, R. Ji, P. Cui, Q. Dai, and G. Hua, “Hyperspectral image
[21] F. de Morsier, M. Borgeaud, V. Gass, J.-P. Thiran, and D. Tuia, “Kernel classification through bilayer graph-based learning,” IEEE Transactions
low-rank and sparse graph for unsupervised and semi-supervised clas- on Image Processing (TIP), vol. 23, pp. 2769–2778, 2014.
sification of hyperspectral images,” IEEE Transactions on Geoscience [38] M. Brell, K. Segl, L. Guanter, and B. Bookhagen, “Hyperspectral and
and Remote Sensing (TGARS), vol. 54, pp. 3410–3420, 2016. lidar intensity data fusion: A framework for the rigorous correction of
[22] J. Liu, Z. Wu, J. Li, A. Plaza, and Y. Yuan, “Probabilistic-kernel illumination, anisotropic effects, and cross calibration,” IEEE Transac-
collaborative representation for spatial-spectral hyperspectral image tions on Geoscience and Remote Sensing (TGARS), vol. 55, pp. 2799–
classification,” IEEE Transactions on Geoscience and Remote Sensing 2810, 2017.
(TGARS), vol. 54, pp. 2371–2384, 2016. [39] S. Jia, J. Hu, J. Zhu, X. gJia, and Q. Li, “Three-dimensional local binary
[23] Q. Wang, Y. Gu, and D. Tuia, “Discriminative multiple kernel learning patterns for hyperspectral imagery classification,” IEEE Transactions
for hyperspectral image classification,” IEEE Transactions on Geo- on Geoscience and Remote Sensing (TGARS), vol. 55, pp. 2399–2413,
science and Remote Sensing (TGARS), vol. 54, pp. 3912–3927, 2016. 2017.
[24] B. Guo, S. R. Gunn, R. I. Demper, and J. D. B. Nelson, “Customizing [40] S. Jia, B. Deng, J. Zhu, and Q. Li, “Superpixel-based multitask learning
kernel functions for SVM-based hyperspectral image classification,” framework for hyperspectral image classification,” IEEE Transactions
IEEE Transactions on Image Processing (TIP), vol. 17, pp. 622–629, on Geoscience and Remote Sensing (TGARS), vol. 55, pp. 2575–2588,
2008. 2017.
[25] L. Yang, M. Wang, S. Yang, R. Zhang, and P. Zhang, “Sparse spatio- [41] S. Mei, Q. Bi, J. Ji, J. Hou, and Q. Du, “Hyperspectral image classifica-
spectral lapSVM with semisupervised kernel propagation for hyperspec- tion by exploring low-rank property in spectral or/and spatial domain,”
tral image classification,” IEEE Journal of Selected Topics in applied IEEE Journal of Selected Topics in applied Earth Observations and
Earth Observations and Remote Sensing (J-STARS), no. 99, pp. 1–9, Remote Sensing (J-STARS), no. 99, pp. 1–12, 2017.
2017. [42] H. Su, Y. Cai, and Q. Du, “Firefly-algorithm-inspired framework with
[26] R. Roscher and B. Waske, “Shapelet-based sparse representation for band selection and extreme learning machine for hyperspectral image
landcover classification of hyperspectral images,” IEEE Transactions classification,” IEEE Journal of Selected Topics in applied Earth Obser-
on Geoscience and Remote Sensing (TGARS), vol. 54, pp. 1623–1634, vations and Remote Sensing (J-STARS), no. 10, pp. 309–320, 2017.
2016. [43] E. Strobl and S. Visweswaran, “Deep multiple kernel learning,” in
[27] J. Liu and W. Lu, “A probabilistic framework for spectral-spatial clas- IEEE International Conference on Machine Learning and Applications
sification of hyperspectral images,” IEEE Transactions on Geoscience (ICMLA), 2013.
and Remote Sensing (TGARS), vol. 54, pp. 5375–5384, 2016. [44] Y. Jia*, E. Shelhamer*, J. Donahue, S. Karayev, J. Long, R. Girshick,
[28] A. Zehtabian and H. Ghassemian, “Automatic object-based hyperspectral S. Guadarrama, and T. Darrell, “Caffe: Convolutional architecture for
image classification using complex diffusions and a new distance met- fast feature embedding,” in ACM Multimedia (ACMMM), 2014.
ric,” IEEE Transactions on Geoscience and Remote Sensing (TGARS),
vol. 54, pp. 4106–4114, 2016.
[29] S. Jia, J. Hu, Y. Xie, L. Shen, X. Jia, and Q. Li, “Gabor cube selection
based multitask joint sparse representation for hyperspectral image
classification,” IEEE Transactions on Geoscience and Remote Sensing
(TGARS), vol. 54, pp. 3174–3187, 2016.
[30] J. Xia, J. Chanussot, P. Du, and X. He, “Rotation-based support
vector machine ensemble in classification of hyperspectral data with
limited training samples,” IEEE Transactions on Geoscience and Remote
Sensing (TGARS), vol. 54, pp. 1519–1531, 2016.
14

Dr. Hyungtae Lee received the BS degree in elec-


trical engineering and mechanical engineering from
Sogang University, Seoul, Korea in 2006, MS degree
from Korea Advanced Institute of Science and Tech-
nology (KAIST), Deajoen, Korea in 2008, and PhD
degree from University of Maryland, College Park,
MD, USA in 2014. He works as a electrical engi-
neering senior consultant for Booz Allen Hamilton
Inc. at U.S. Army Research Laboratory in Adelphi,
MD. His current research interests include object,
action, event, and pose recognition in computer
vision, and machine learning.

Dr. Heesung Kwon received the B.Sc. degree


in Electronic Engineering from Sogang University,
Seoul, Korea, in 1984, and the MS and Ph.D. degrees
in Electrical Engineering from the State University
of New York at Buffalo in 1995 and 1999, respec-
tively. From 1983 to 1993, he was with Samsung
Electronics Corp., where he worked as a senior
research engineer. He was with the U.S. Army Re-
search Laboratory (ARL), Adelphi, MD from 1996
to 2006 working on automatic target detection and
hyperspectral signal processing applications. From
2006 to 2007, he was with Johns Hopkins University Applied Physics
Laboratory (JHU/APL) working on biological standoff detection problems.
Dr. Kwon rejoined ARL in August, 2007 as a senior electronics engineer,
leading hyperspectral research efforts in the Image Processing Branch. Dr.
Kwon is currently Associate Editor of IEEE Trans. on Aerospace and
Electronic Systems. He also served as Lead Guest Editor of the Special Issue
on Algorithms for Multispectral and Hyperspectral Image Analysis of the
Journal of Electrical and Computer Engineering. His current research interests
include image/video analytics, human-autonomy interaction, hyperspectral
signal processing, machine learning, and statistical learning. He has published
over 100 journal, book chapters, and conference papers on these topics.

You might also like