0% found this document useful (0 votes)
2 views

Signal_Modulation_Classification_Based_on_the_Transformer_Network

This paper introduces the Transformer Network (TRN) for automatic modulation classification (AMC), demonstrating its ability to utilize global information for improved classification accuracy, particularly in low signal-to-noise ratio (SNR) environments. The TRN-based model outperforms traditional methods and other deep learning models, such as CNN and RNN, while requiring fewer training parameters. The architecture includes a preprocessing stage, a linear projection layer, and a Transformer encoder, effectively processing IQ signal data for classification tasks.

Uploaded by

Arush Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Signal_Modulation_Classification_Based_on_the_Transformer_Network

This paper introduces the Transformer Network (TRN) for automatic modulation classification (AMC), demonstrating its ability to utilize global information for improved classification accuracy, particularly in low signal-to-noise ratio (SNR) environments. The TRN-based model outperforms traditional methods and other deep learning models, such as CNN and RNN, while requiring fewer training parameters. The architecture includes a preprocessing stage, a linear projection layer, and a Transformer encoder, effectively processing IQ signal data for classification tasks.

Uploaded by

Arush Sharma
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

1348 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 8, NO.

3, SEPTEMBER 2022

Signal Modulation Classification Based on the


Transformer Network
Jingjing Cai , Fengming Gan , Xianghai Cao, and Wei Liu, Senior Member, IEEE

Abstract—In this work, the Transformer Network (TRN) decision-theoretic approach based on Bayesian theory. The
is applied to the automatic modulation classification (AMC) same approach was adopted in [10] to identify varieties of
problem for the first time. Different from the other deep analog and digital signals, and a classifier based on qLLR
networks, the TRN can incorporate the global information of each
sample sequence and exploit the information that is semantically (quasi-Log-Likelihood Ratio) rule was proposed in [11]. As
relevant for classification. In order to illustrate the performance the computational cost increases substantially in a complex
of the proposed model, it is compared with four other deep mod- signal environment due to an increasing number of parameters,
els and two traditional methods. Simulation results show that a classifier based on Bayesian theory was developed for fast
the proposed one has a higher classification accuracy especially modulation classification in [12]. Prior knowledge about the
at low signal to noise ratios (SNRs), and the number of train-
ing parameters of the proposed model is less than those of the signal model is essential for the likelihood based approaches,
other deep models, which makes it more suitable for practical which limits their practical applications.
applications. For the second category, feature based approaches mainly
Index Terms—Automatic modulation classification, trans- rely on discriminative features extracted from the raw sig-
former network, deep learning. nal data. These features mainly contain the following types,
such as spectrum features [13], cumulants and moments [14],
I. I NTRODUCTION [15], [16], [17], [18], [19], and zero crossing features [3].
However, the above methods may not work well at low SNR
ODULATION classification plays a significant role in
M wireless spectrum monitoring [1], [2]. In the early stage,
it usually relied on experts to provide decisions on modu-
since the derived features may not provide enough discrimi-
nating information. To reduce adverse effects of noise, wavelet
transform was introduced in [20], which requires no prior
lation types based on the parameters of measured signals. information of the received signal such as signal sampling
However, this approach is not practical in crowded electro- rate or carrier frequency. By fully considering the noise factor,
magnetic environments due to its slow response. Nowadays, further methods have been developed, such as the correntropy
various automatic modulation classification (AMC) techniques coefficient based algorithm [21]. However, the hand-crafting
have been proposed, which are more suitable for scenarios features may make it difficult to apply to new modulation types
encountered in modern warfare for its swift response [3], [4], in non-cooperative scenarios [22].
[5], [6], [7]. There are mainly two categories of approaches for With the fast development of deep learning techniques [23],
AMC: the likelihood based approaches and feature based ones. [24], [25], [26], [27], the modulation classification problem
For the first category, it is usually based on the hypothesis has reached a new solution, and some algorithms have been
testing theory and constructs a judgment criterion by analyzing proposed in this direction. In [28], the CNN (Convolutional
statistical characteristics of signals [8]. The ALRT (Average Neural Network) based modulation classification approach has
Likelihood Rate Test) algorithm was proposed in [9], which achieved significant performance improvement compared with
can distinguish BPSK (Binary Phase Shift Keying) and QPSK the traditional methods, especially for low SNR cases. The
(Quadrature Phase Shift Keying) signals by employing the deep model in [29] can still have a high classification accu-
Manuscript received 30 October 2021; revised 1 March 2022 and racy when the length of the signal is larger than the designed
10 May 2022; accepted 15 May 2022. Date of publication 20 May 2022; CNN input length. In [30], [31], the RNN (Recurrent Neural
date of current version 9 September 2022. This work was supported by the Network) was applied to modulation classification with a sat-
National Natural Science Foundation of China (No.61805189 and 62176199),
the Fundamental Research Funds for the Central Universities (No.JB210201), isfactory classification performance achieved. As CNNs lack
the Natural Science Basic Research Plan in Shaanxi Provincial of China the time sensitivity of RNNs and RNNs lack the lightness of
(No.2018JQ6068). The associate editor coordinating the review of this arti- CNNs, some algorithms combining CNNs and RNNs together
cle and approving it for publication was X. Chen. (Corresponding author:
Xianghai Cao.) were proposed [22], [32], which outperform the CNN based
Jingjing Cai and Fengming Gan are with the School of Electronic or RNN based algorithms. The STN (Spatial Transformer
Engineering, Xidian University, Xi’an 710071, China (e-mail: Network) can be incorporated into an existing CNN architec-
[email protected]; [email protected]).
Xianghai Cao is with the School of Artificial Intelligence, Xidian ture, explicitly allowing spatial manipulation of data within
University, Xi’an 710071, China (e-mail: [email protected]). the network [26], and it was also applied to modulation
Wei Liu is with the Department of Electronic and Electrical classification to improve performance [33], [34], [35].
Engineering, University of Sheffield, Sheffield S1 3JD, U.K. (e-mail:
[email protected]). Processing a long sequence will incur a high computational
Digital Object Identifier 10.1109/TCCN.2022.3176640 complexity for deep learning based networks. The Transformer
2332-7731 
c 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: University of Arizona. Downloaded on March 25,2025 at 06:21:26 UTC from IEEE Xplore. Restrictions apply.
CAI et al.: SIGNAL MODULATION CLASSIFICATION BASED ON TRN 1349

Network (TRN) was first proposed in [36] and widely used The RNN is also widely used in language transformation,
for natural language processing [37], which is suitable for but there is a feature difference between it and the TRN. The
a long sequence input. The TRN can be seen as an exten- RNN uses sequential computation which causes the vanishing
sion of the RNN by removing sequential computation with gradient problem, while the TRN uses parallel computation,
improved performance [38], [39], [40]. Many efficient TRNs so it can be trained more easily and may be more suitable for
have been proposed, which can process long sequences effi- practical applications.
ciently and take a small storage space [41], [42], [43]. TRNs The STN is a learnable module, which can be incorpo-
were also applied to other research areas [44], [45], [46], such rated into an existing model, such as CNN [26]. The feature
as image classification and the performance is comparable to differences between STN and TRN are presented as follows:
CNNs [47]. (1) The STN is usually embedded into a CNN, leading to
In this paper, the TRN is applied to the modulation classifi- a greater number of training parameters and higher computa-
cation problem for the first time by constructing a TRN-based tional complexity. This problem does not exist in TRN as it
model with IQ data input. The TRN has the ability to exploit is an independent network.
the correlation of IQ sequences and capture powerful fea- (2) The TRN concentrates on temporal relations of the input,
tures of the signals, which may lead to improved performance. while the STN focuses on spatial information of the input.
Simulation results show that the TRN-based model has the From this point of view, it may be more beneficial to employ
best performance on classification accuracy compared with the TRN for input signal processing than the STN.
the other four deep models, i.e., CNN, LSTM (Long Short- As a result, for signal modulation classification with the IQ
Term Memory), SCRNN (Sequential Convolutional Recurrent sequence input, the TRN may be a more suitable choice, as
Neural Network) and STN, and two traditional methods, demonstrated by computer simulations later.
i.e., KNN (K-Nearest Neighbor) and SVM (Support Vector
Machine), especially at low SNRs, and the number of param- B. Overview of the TRN-Based Model
eters of the proposed model is less than that of the other deep
The architecture of the TRN-based model is shown in Fig. 1.
models.
It can be roughly divided into two parts: preprocessing and
The paper is organized as follows. In Section II, the TRN-
TRN. There are mainly three parts in TRN, including the
based model is provided, including motivation and detailed
Linear Projection Layer, the Transformer Encoder and the
construction of the proposed model. Simulation results are
Multi-Layer Perception (MLP) Head.
presented in Section III, where the proposed model is com-
The process of signal modulation classification based on the
pared with the other deep models and traditional methods,
TRN-based model can be summarized as follows: the IQ sig-
and impact of various settings for the proposed model is also
nal is transformed into sequences by the preprocessing stage,
studied. Conclusions are drawn in Section IV.
which are then embedded into linear sequences in the Linear
Projection Layer; an additional learnable “classification token”
II. T HE T RANSFORMER N ETWORK BASED M ODEL with position embedding is added before being fed into the
A. Motivation Transformer Encoder; the output of the Transformer Encoder
is then served as the input of the MLP Head, which consists of
The CNN is one of the most popular networks, which was
several fully connected layers and dropout layers; the output
proposed for image classification and then widely used in
of the MLP Head is the final classification result. Details of
many other classification problems. The TRN is proposed for
the above process are provided in the next three subsections.
language transformation, which is then gradually used in a
These parts play different roles in the process of signal
few other applications, such as image classification. There is
modulation classification. In general, the preprocessing stage
a big difference between these two networks: the TRN is based
processes the IQ data before being fed into the TRN, the
on the attention mechanism, while the CNN is not. It causes
Linear Projection Layer projects the data linearly to retain the
some feature differences between these two networks, which
most discriminating features of the signal, the Transformer
are listed as below:
Encoder extracts several more powerful features of the sig-
(1) Correlations between elements of the input sequence
nal, and then the MLP Head makes the final decision for
play an important role in the TRN, while they are not exploited
modulation classification.
directly in the CNN.
(2) The TRN establishes a long-distance dependence, so
more powerful features can be extracted by incorporating the C. IQ Data Preprocessing
global information, while the CNN concentrates on the local The IQ data of the signal can not be directly applied to
information, and the contextual information of the input cannot the TRN, as the TRN requires several sequences as its input,
be fully exploited for feature capturing. while the original data of the IQ signals are in two sequences.
(3) The global information can be integrated easily by As the in-phase and quadrature data is sampled in a pair, they
the TRN without stacking any extra layers, and thus fewer should be combined together in a patch to serve as one of
parameters are needed. However, the CNN extracts the global the input sequences of the TRN, which makes the feature of
information from the local one by continuously stacking the signal clearer. So we divide the long IQ signal sequences
convolutional layers, which dramatically increases the number into multiple shorter sequences with equal lengths, and then a
of parameters of the whole network. group of shorter IQ signal sequences is obtained as the input

Authorized licensed use limited to: University of Arizona. Downloaded on March 25,2025 at 06:21:26 UTC from IEEE Xplore. Restrictions apply.
1350 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 8, NO. 3, SEPTEMBER 2022

Fig. 1. The architecture of the TRN-based model.

⎡ ⎤
r1,1 ... r1,M
⎢ r2,1 ... r2,M ⎥
⎢ ⎥
= ⎢. .. .. ⎥ (2)
⎣ .. . . ⎦
rN ,1 ... rN ,M
(3) Divide R into Z P × Q patches Rz1 ,z2 .
Suppose Z = NM/PQ, Z1 = N /P and Z2 = M /Q are all
integers.
⎡ ⎤
R1,1 . . . R1,Z2
⎢ ⎥
R = ⎣ ... .. .
. .. ⎦ (3)
RZ1 ,1 ... RZ1 ,Z2
with
⎡ ⎤
r(z1 −1)P +1,(z2 −1)Q+1 ... r(z1 −1)P +1,z2 Q
⎢ r(z −1)P +2,(z −1)Q+1 ... r(z1 −1)P +2,z2 Q ⎥
⎢ 1 2 ⎥
Rz1 ,z2 = ⎢ . .. .. ⎥
⎣ .. . . ⎦
Fig. 2. The process for data preprocessing.
rz1 P ,(z2 −1)Q+1 ... rz1 P ,z2 Q
z1 = 1, . . . , Z1 , z2 = 1, . . . , Z2 (4)
of the TRN. The preprocessing steps are shown in Fig. 2 and
(4) Transform the patches Rz1 ,z2 into 1 × PQ sequences r̂z1 ,z2 .
detailed steps are listed as follows. 
(1) Combine the I and Q signal sequences into one vector x. r̂z1 ,z2 = Rz1 ,z2 (1, :), Rz1 ,z2 (2, :), . . . , Rz1 ,z2 (P , :) (5)
The I and Q signal sequences are defined as xI =
where Rz1 ,z2 (i , :) is the i-th row of the patch matrix.
[x̄1 , x̄2 , . . . , x̄Len ] and xQ = [x̂1 , x̂2 , . . . , x̂Len ], respectively,
The sequences r̂z1 ,z2 are then used as the input of TRN.
where Len is the length of the original data, and they are
combined as
D. The Linear Projection Layer
x = [x̄1 , x̄2 , . . . , x̄Len , x̂1 , x̂2 , . . . , x̂Len ] (1) The Linear Projection Layer is the first part of the TRN
and it is the link between the preprocessed data and the
(2) Transform x into an N × M matrix R.
Transformer Encoder. The processing steps are listed as
Suppose 2Len = N × M, and the vector x is restructured
follows [47].
into an N × M matrix R
⎡ ⎤ (1) Linearly project the sequences r̂z1 ,z2 and construct the
x̄1 . . . x̄M matrix Ṙ.
⎢ x̂1 . . . x̂M ⎥ Suppose E is an embedding projection matrix with dimen-
⎢ ⎥
⎢ .. .. .. ⎥ sion PQ × D, where D is the size of the vector used in the
R = ⎢. . . ⎥
⎢ ⎥ Transformer Encoder. The 1 × D vector r̄z1 ,z2 is defined as
⎣ x̄Len−M +1 . . . x̄Len ⎦
x̂Len−M +1 . . . x̂Len r̄z1 ,z2 = r̂z1 ,z2 E (6)

Authorized licensed use limited to: University of Arizona. Downloaded on March 25,2025 at 06:21:26 UTC from IEEE Xplore. Restrictions apply.
CAI et al.: SIGNAL MODULATION CLASSIFICATION BASED ON TRN 1351

(2) Calculate the query matrix Qi , the key matrix Ki and


the value matrix Vi , separately.
Qi = Q̄WiQ , Ki = K̄WiK , Vi = V̄WiV (11)
Q
where Wi , WiK , and WiV , i = 1, 2, 3, . . . , Ho are the matri-
ces with dimensions dmodel ×dk , dmodel ×dk and dmodel ×dv ,
separately.
(3) Calculate headi by the operation Attention(·) as fol-
lows
headi = Attention(Qi , Ki , Vi ) (12)
with
Q K T

Fig. 3. Structure of the Multihead Attention.


Attention(Qi , Ki , Vi ) = softmax √i d i Vi (13)
k

(4) Construct the matrix Head by multiplying the vector


Then, the matrix Ṙ is constructed by combining all the
headi with the matrix W0
sequences r̄z1 ,z2 as
 Suppose W0 is an Ho dv × dmodel matrix. we have
Ṙ = r̄1,1 ; . . . ; r̄1,Z2 ; . . . ; r̄Z1 ,1 . . . ; r̄Z1 ,Z2
Head = [head1 , . . . , headHo ]W0 . (14)
= ṙ1 ; . . . ; ṙ(z1 −1)Z2 +z2 ; . . . ; ṙZ1 Z2 (7)
(5) Obtain the output Z̄ of the MHA by adding the matrix
(2) Construct R̂ by adding the learnable class token vector R̆ with Head.
c to Ṙ.
The vector c = [c1 , c2 , . . . , cD ] is firstly initialized ran- Z̄ = Head + R̆ (15)
domly and then updated during the training process. The The output of the Transformer Encoder is obtained by
matrix R̂ with dimension (Z1 Z2 + 1) × D is constructed as inputting Z̄ into the MLP block. If there are more than one
R̂ = c; Ṙ (8) layers in the Transformer Encoder, consider the output of the
former layer as the input of the next and follow steps described
(3) Construct the matrix R̆ by adding a learned position above. Then, choose the first row of the final output of the
encoding matrix Epos to the sequence R̂. Transformer Encoder as the input of the MLP Head, and the
Suppose Epos is a (Z1 Z2 + 1) × D matrix and then R̆ is output of the MLP Head is the final classification result.
given by
III. S IMULATIONS AND A NALYSES
R̆ = R̂ + Epos (9)
In this section, the classification performance of the TRN-
R̆ is the input of the Transformer Encoder. based model is compared with the other four deep models and
two traditional methods, which are CNN-based model, LSTM-
E. The Transformer Encoder based model, SCRNN-based model, STN-based model, KNN
The Transformer Encoder consists of a stack of L same and SVM, respectively. Three benchmark datasets used for
layers, and each layer is mainly composed of an MLP block the simulations are introduced firstly, and the impact of dif-
and a Multihead Attention (MHA). The MLP is a simple ferent settings of the parameters for the TRN-based model is
fully connected feed-forward network, while the MHA is more studied to find the best set of values. Then, some implemen-
complicated. The Layernorm (LN) is applied to residual con- tation details and the classification results of the deep models
nections after every block, which can mitigate the vanishing and traditional methods on the three benchmark datasets are
gradient problem due to a deeper depth of the neural network. presented.
The structure of the MHA is shown in Fig. 3. The Scaled
Dot-Production attention, called attention for short in the fol- A. Datasets
lowing, is the most important part of MHA. The attention can Three datasets are used in our simulations, RML2016.04C
be described as mapping a query and a set of key-value pairs to and RML2016.10 and RML2018.01a, which are generated by
an output and linking different positions of a single sequence, GNU Radio [48], [49]. RML2016.10a is an upgraded ver-
which integrates information across the entire input data. sion of RML2016.04C, which considers more effects of the
Denote the number of times of performing the attention real electromagnetic environment. RML2018.01a is a more
operation in the MHA as Ho , and the process for attention robust dataset with a larger amount of data than the other
operation can be described as follows [36]. two datasets.
(1) Embed the input R̆ of the Transformer Encoder by the The parameters of these three datasets are listed in Table I
D × dmodel embedding matrix W. and in-phase signal samples of the four modulation types
Q̄ = K̄ = V̄ = R̆W (10) are presented in Fig. 4. It can be seen that the modulations
QPSK and 8PSK are easy to be confused, while WBFM and
where Q̄, K̄ and V̄ are the resultant matrices. AM-DSB also look similar to each other.
Authorized licensed use limited to: University of Arizona. Downloaded on March 25,2025 at 06:21:26 UTC from IEEE Xplore. Restrictions apply.
1352 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 8, NO. 3, SEPTEMBER 2022

Fig. 4. The in-phase signal samples of the four modulation types at 10 dB SNR.

TABLE I
PARAMETERS OF THE B ENCHMARK DATASETS optimizer and sparse entropy loss function are applied in the
deep models.
The first one is the CNN-based model [28]. It contains four
layers, where two layers are convolutional layers and the other
two are dense layers. Each hidden layer utilizes the Rectified
Linear Unit (ReLU) activation function.
The second one is the LSTM-based model [31]. Two parts
are included in this model: the first part contains two 128-unit
LSTM layers while the last part is a 11-unit dense layer. The
first LSTM layer returns the full sequences while the second
one returns the last state.
The third one is the SCRNN-based model [22]. It combines
the speed and lightness of the CNN and temporal sensitivity
of RNN, and can be divided into three parts: the first part
contains two convolutional layers with 128 filters and ReLU
activation functions, the second part are two 128-unit LSTM
layers with ReLU activation functions, and the last part is a
dense layer.
The fourth one is the STN-based model [34], which con-
tains two parts. The first one is STN, which is composed of
Localisation Network, Grid Generator and Sampler. The last
one is CNN which is composed of two convolutional neural
layers, each layer followed by a max pooling layer, and two
dense layers at the final stage.
The fifth one is the SVM [51]. It maps the input data into
The categorical cross entropy is adopted as the loss function, a high-dimensional feature space, where a linear decision sur-
which can be written as: face is constructed. Different types of classes are located on
different sides of the decision surface.
N
batch The last one is the KNN [52], which follows the nearest
1
Floss = − yi · log(ŷi ). (16) decision rule, and assigns an unclassified sample point to the
Nbatch
i=1 nearest classified set.
where yi represents the ground truth in the form of one-hot
encoding, ŷi is the prediction and Nbatch is the training batch
C. Parameter Analyses
size set as 32.
At each SNR, there are two sample sets, the training set As the classification accuracy varies with the parameter set-
(90%) and the test set (10%). The Adam optimizer with a tings, it is necessary to fine-tune the TRN-based model. The
learning rate of 0.01 is utilized. All trainings and predictions impact of the parameters of the TRN-based model on the clas-
are implemented in Tensorflow [50]. sification performance is investigated below, including patch
size, the MLP size, the layer number, the attention number,
the batch size, the training epoch, the test set size and the
B. Comparisons With Deep Models and Traditional Methods optimizer.
Four deep models and two traditional methods are used to The patch size related result is shown in Fig. 5. It can be
compare with the TRN-based model for signal modulation seen that the classification accuracy is the highest when the
classification, and their settings are introduced briefly one by patch size is 4. The classification performance of the TRN is
one in the following. For all the deep models, the softmax sensitive to the length of the sequence, and an inappropriate
activation function is used in the last layer, and the Adam length will cause position embedding and position information

Authorized licensed use limited to: University of Arizona. Downloaded on March 25,2025 at 06:21:26 UTC from IEEE Xplore. Restrictions apply.
CAI et al.: SIGNAL MODULATION CLASSIFICATION BASED ON TRN 1353

Fig. 5. Classification performance of the TRN-based model with a varying


patch size.

Fig. 6. Classification performance of the TRN-based model with a varying Fig. 7. Classification performance of the TRN-based model with a varying
MLP size. (a) layer number, (b) attention number.

confusion, which may lead to unsatisfactory classification


performance. For the MLP size, as shown in Fig. 6, the clas-
sification accuracy is the highest when the MLP size is 256.
When the MLP size is smaller than 256, the generalization
ability of the TRN is poorer, which leads to worse classifica-
tion performance. However, with the MLP size greater than
256, the classification accuracy drops due to the overfitting
problem.
The layer number and attention number related results are
given in Fig. 7(a) and Fig. 7(b). It can be seen that the
best result is achieved when the layer number is 5 and the
Fig. 8. Classification performance of the TRN-based model with a varying
attention number is 16. The depth of the proposed model batch size.
increases as the number of layers and attentions increases, and
the attention distance increases as the depth of the proposed
model increases. The attention distance of the TRN is sim- epoch is too small, it may lead to an unsatisfactory classifi-
ilar to the size of the receptive field of the CNN, and cation performance. Furthermore, the model will converge if
the larger the attention distance the stronger its ability to the epoch exceeds a threshold value.
extract features. However, if these parameters are too large, Splitting the dataset is also an important step before training.
the classification accuracy may drop due to the overfitting The test set size directly affects the classification accuracy
problem. of the model and a proper test set size promotes the model
The batch size result in Fig. 8 indicates that the classifi- training. The impact of the test set size is shown in Fig. 10
cation performance is the best when the batch size is 32. and it can be seen that the best performance corresponds to
Normally, the greater the batch size the better the performance, a test set size of 10%. If the test set size is too small and
but the performance will degrade when the batch size exceeds the training set size is too big, the model may suffer from the
a threshold value [53]. overfitting problem; otherwise, it may lead to the underfitting
The impact of the training epoch is demonstrated in Fig. 9, problem.
where it can be seen that the epoch 15 provides the best The optimizer plays an important role in the training pro-
performance. As the loss is reduced after each epoch, if the cess. An appropriate optimizer makes the network converge

Authorized licensed use limited to: University of Arizona. Downloaded on March 25,2025 at 06:21:26 UTC from IEEE Xplore. Restrictions apply.
1354 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 8, NO. 3, SEPTEMBER 2022

TABLE II
PARAMETERS FOR THE TRN-BASED M ODEL

Fig. 9. Classification performance of the TRN-based model with a varying


training epoch.

Fig. 12. Classification performance of the deep models and the traditional
methods based on RML2016.04C.
Fig. 10. Classification performance of the TRN-based model with a varying
test set size.

Fig. 13. Classification performance of the deep models and the traditional
methods based on RML2016.10a.
Fig. 11. Classification performance of the TRN-based model with a varying
optimizers.

The classification performance of the deep models and tra-


quickly. There are some popular optimizers, such as Adam, ditional methods based on the first dataset is presented in
SGD, AdaGrad and so on. The classification accuracies with Fig. 12. It can be seen that the proposed model performs
different optimizers are provided in Fig. 11. It can be seen the best in the whole SNR range. Compared with the CNN-
that the Adam optimizer is the best choice. based and STN-based models, the proposed one outperforms
them by nearly 15% and 10% at lower SNRs, from −16dB
to −12dB, respectively. The STN-based model outperforms
D. Performance Comparisons the CNN-based one, which implies that by making the model
The data with SNR ranging from −16 dB to 10 dB is spatially-invariant, robustness of the model is enhanced against
chosen from the datasets RML2016.04C RML2016.10a and various adverse effects. Furthermore, the proposed model has
RML2018.01a, separately. The settings of the TRN-based better performance than those of LSTM-based and SCRNN-
model are provided in Table II. The parameters of the four based models, implying that more powerful features can be
compared deep models and the two traditional methods are extracted by the TRN-based model. Besides, The classifica-
all set for their best performance. The classification results tion performance of the STN-based and CNN-based models
are presented in Figs. 12, 13 and 14. are close to that of the TRN-based one with SNR ranging

Authorized licensed use limited to: University of Arizona. Downloaded on March 25,2025 at 06:21:26 UTC from IEEE Xplore. Restrictions apply.
CAI et al.: SIGNAL MODULATION CLASSIFICATION BASED ON TRN 1355

TABLE III
C OMPARISON OF T RAINING PARAMETERS B ETWEEN F OUR D EEP
M ODELS AND THE P ROPOSED M ODEL

STN-based one in most cases under the relevant datasets, and


the TRN-based model is superior to the STN-based one for
the range of SNR equal to or lower than −10 dB in Fig. 12.
Moreover, as shown in Table III, the training parameter
Fig. 14. Classification performance of the deep models and the traditional number of the TRN-based model is smaller than that of the
methods based on RML2018.01a. other deep models. There are fewer training parameters in
every layer of the LSTM-based model, so the total number
of training parameters is small. For the TRN-based model,
from 0 dB to 10 dB, but there is a gap of about 5% and 6% only a few layers are needed to extract the features due to the
between the SCRNN-based and LSTM-based models and the existence of attention mechanism, so the number of training
TRN-based model, respectively. Compared with the traditional parameters of the proposed model is also small. Although the
method SVM, the proposed model outperforms it by nearly LSTM-based and TRN-based models have a low number of
10% at lower SNRs and 5% at higher SNRs, while KNN has training parameters, the performance of the TRN-based model
the worst performance. outperforms that of the LSTM-based model.
The classification performance based on the second dataset In deep learning, the confusion matrix is a visual tool to
is presented in Fig. 13. It can be seen that the proposed model compare the predicted results with the true values. All the
still performs the best. The TRN-based model outperforms the classification results for all the classes are displayed in a con-
STN-based and CNN-based ones by nearly 2% and 4% in the fusion matrix, where each column represents the predicted
whole SNR range, respectively. The STN-based model still modulation class and each row represents the real modulation
has a better classification performance than that of the CNN- class. The numerical value on each grid denotes the prediction
based model. Furthermore, the proposed model outperforms probability of the corresponding modulation class. Confusion
the LSTM-based and SCRNN-based models by nearly 2% at matrices of the optimal TRN-based model at various SNRs are
lower SNRs and 10% at higher SNRs, and the performance of presented in Fig. 15. It can be seen that the diagonals become
the TRN-based model is better than that of the SVM by nearly sharper with an increasing SNR, which illustrates that the
5% at lower SNRs and 15% at higher SNRs. The KNN again higher the SNR the better the classification accuracy. However,
has the worst performance in the whole SNR range. Generally, the confusion between 8PSK and QPSK always exists even at
the classification performance based on RML2016.10a drops a high SNRs. The phases of QPSK is the subset of those of
lot compared to that based on RML2016.04C. As the signals 8PSK, so the curves of their signals are sometimes identi-
of the dataset RML2016.10a are affected by several differ- cal. When the signals are under unideal channel conditions,
ent electromagnetic environments, while the signals of the it becomes harder to distinguish the modulation features of
RML2016.04C are generated in a single electromagnetic envi- 8PSK and QPSK.
ronment, the training based on RML2016.10a is harder than
that of RML2016.04C, leading to performance deterioration
in the former case [54]. IV. C ONCLUSION
The classification performance based on the last dataset In this paper, the TRN-based model has been constructed
is shown in Fig. 14. It can be seen that only TRN- and applied to the automatic modulation classification problem
based, SCRNN-based and STN-based models maintain a good successfully for the first time. As demonstrated by simulation
performance, while the performance of the other two deep results, the proposed model outperforms the other three deep
models and two traditional methods drops a lot compared to models (LSTM, CNN and SCRNN) and the two traditional
that based on the first and second datasets. The TRN-based methods (KNN and SVM) in terms of classification accuracy,
model still performs the best in the whole SNR range, which and the number of training parameters of the proposed model
outperforms the SCRNN-based and STN-based models by is less than the other deep models. In comparison with STN,
nearly 1% at lower SNRs and performs almost the same at the proposed model performs at least as well as STN under
higher SNRs. For the CNN-based model, it tends to be over- most relevant datasets, but it is characterized with a smaller
fitting on the training set for its simple network structure and number of training parameters and a lower overall computation
the huge number of samples of the new dataset. For the SVM, complexity. The impact on performance of different param-
both the total number of samples and the dimension of the eter settings was also studied to find the best configuration
signal increase, leading to a huge amount of computation in for the proposed model. Possible future work can focus on
training, and its performance gets worse. applying various improved TRNs to modulation classification,
It can be seen from Figs. 12, 13 and 14 that the performance which may further improve the performance. The signals used
of the proposed TRN-based model is similar to that of the in the simulations now are under a single type of channel

Authorized licensed use limited to: University of Arizona. Downloaded on March 25,2025 at 06:21:26 UTC from IEEE Xplore. Restrictions apply.
1356 IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, VOL. 8, NO. 3, SEPTEMBER 2022

Fig. 15. Confusion matrices of the optimized TRN-based model on the dataset RML2016.04C at different SNRs.

condition, so it is essential to do more simulations under dif- [11] A. Polydoros and K. Kim, “On the detection and classification of quadra-
ferent types of channel conditions to have a better training ture digital modulations in broad-band noise,” IEEE Trans. Commun.,
vol. 38, no. 8, pp. 1199–1211, Aug. 1990.
result. Furthermore, the robustness of the model may degrade [12] M. L. D. Wong, S. K. Ting, and A. K. Nandi, “Naïve bayes classification
due to lack of essential labels, and thus semi-supervised or of adaptive broadband wireless modulation schemes with higher order
unsupervised methods could be further investigated. cumulants,” in Proc. 2nd Int. Conf. Signal Process. Commun. Syst., 2008,
pp. 1–5.
[13] M. P. DeSimio and G. E. Prescott, “Adaptive generation of decision
functions for classification of digitally modulated signals,” in Proc. IEEE
R EFERENCES Nat. Aerosp. Electron. Conf., 1988, pp. 1010–1014.
[14] H.-C. Wu, M. Saquib, and Z. Yun, “Novel automatic modulation clas-
[1] T. Yucek and H. Arslan, “A survey of spectrum sensing algorithms for sification using cumulant features for communications via multipath
cognitive radio applications,” IEEE Commun. Surveys Tuts., vol. 11, channels,” IEEE Trans. Wireless Commun., vol. 7, no. 8, pp. 3098–3105,
no. 1, pp. 116–130, 1st Quart., 2009. Aug. 2008.
[2] M. Höyhtyä et al., “Spectrum occupancy measurements: A survey and
[15] J. Lopatka and M. Pedzisz, “Automatic modulation classification using
use of interference maps,” IEEE Commun. Surveys Tuts., vol. 18, no. 4,
statistical moments and a fuzzy classifier,” in Proc. WCC ICSP 5th
pp. 2386–2414, 4th Quart., 2016.
Int. Conf. Signal Process. 16th World Comput. Congr., vol. 3, 2000,
[3] S.-Z. Hsue and S. S. Soliman, “Automatic modulation recognition of dig- pp. 1500–1506.
itally modulated signals,” in Proc. IEEE Mil. Commun. Conf. Bridging
[16] A. Swami and B. M. Sadler, “Hierarchical digital modulation clas-
Gap Interoperab. Survivability Security, 1989, pp. 645–649.
sification using cumulants,” IEEE Trans. Commun., vol. 48, no. 3,
[4] C. Clancy, J. Hecker, E. Stuntebeck, and T. O’Shea, “Applications of
pp. 416–429, Mar. 2000.
machine learning to cognitive radio networks,” IEEE Wireless Commun.,
vol. 14, no. 4, pp. 47–52, Aug. 2007. [17] M. W. Aslam, Z. Zhu, and A. K. Nandi, “Automatic modulation clas-
sification using combination of genetic programming and KNN,” IEEE
[5] C. Weber, M. Peter, and T. Felhauer, “Automatic modulation classifi-
Trans. Wireless Commun., vol. 11, no. 8, pp. 2742–2750, Aug. 2012.
cation technique for radio monitoring,” Electron. Lett., vol. 51, no. 10,
pp. 794–796, 2015. [18] O. A. Dobre, Y. Bar-Ness, and W. Su, “Higher-order cyclic cumulants
[6] P. Qi, X. Zhou, S. Zheng, and Z. Li, “Automatic modulation classifi- for high order modulation classification,” in Proc. IEEE Mil. Commun.
cation based on deep residual networks with multimodal information,” Conf. (MILCOM), vol. 1, 2003, pp. 112–117.
IEEE Trans. Cogn. Commun. Netw., vol. 7, no. 1, pp. 21–33, Mar. 2021. [19] V. D. Orlic and M. L. Dukic, “Automatic modulation classification
[7] L. Huang, Y. Zhang, W. Pan, J. Chen, L. P. Qian, and Y. Wu, “Visualizing algorithm using higher-order cumulants under real-world channel con-
deep learning-based radio modulation classifier,” IEEE Trans. Cogn. ditions,” IEEE Commun. Lett., vol. 13, no. 12, pp. 917–919, Dec. 2009.
Commun. Netw., vol. 7, no. 1, pp. 47–58, Mar. 2021. [20] K. Maliatsos, S. Vassaki, and P. Constantinou, “Interclass and intraclass
[8] W. Wei and J. M. Mendel, “Maximum-likelihood classification for digi- modulation recognition using the wavelet transform,” in Proc. IEEE 18th
tal amplitude-phase modulations,” IEEE Trans. Commun., vol. 48, no. 2, Int. Symp. Pers. Indoor Mobile Radio Commun., 2007, pp. 1–5.
pp. 189–193, Feb. 2000. [21] A. I. Fontes, L. A. Pasa, V. A. de Sousa, Jr., F. M. Abinader, Jr.,
[9] K. Kim and A. Polydoros, “Digital modulation classification: The BPSK J. A. Costa, and L. F. Q. Silveira, “Automatic modulation classification
versus QPSK case,” in Proc. 21st Century Mil. Commun. What’s Possible using information theoretic similarity measures,” in Proc. IEEE Veh.
Conf. Rec. Mil. Commun. Conf., 1988, pp. 431–436. Technol. Conf. (VTC Fall), 2012, pp. 1–5.
[10] A. K. Nandi and E. E. Azzouz, “Algorithms for automatic modulation [22] K. Liao, Y. Zhao, J. Gu, Y. Zhang, and Y. Zhong, “Sequential con-
recognition of communication signals,” IEEE Trans. Commun., vol. 46, volutional recurrent neural networks for fast automatic modulation
no. 4, pp. 431–436, Apr. 1998. classification,” IEEE Access, vol. 9, pp. 27182–27188, 2021.

Authorized licensed use limited to: University of Arizona. Downloaded on March 25,2025 at 06:21:26 UTC from IEEE Xplore. Restrictions apply.
CAI et al.: SIGNAL MODULATION CLASSIFICATION BASED ON TRN 1357

[23] G. E. Hinton, S. Osindero, and Y.-W. Teh, “A fast learning algorithm [37] K. Ahmed, N. S. Keskar, and R. Socher, “Weighted transformer network
for deep belief nets,” Neural Comput., vol. 18, no. 7, pp. 1527–1554, for machine translation,” 2017, arXiv:1711.02132.
2006. [38] T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur,
[24] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classifica- “Recurrent neural network based language model,” in Proc. Interspeech,
tion with deep convolutional neural networks,” in Advances in Neural vol. 2, 2010, pp. 1045–1048.
Information Processing Systems, vol. 25. Red Hook, NY, USA: Curran, [39] T. Mikolov, S. Kombrink, L. Burget, J. Černockỳ, and S. Khudanpur,
2012, pp. 1097–1105. “Extensions of recurrent neural network language model,” in Proc.
[25] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2011,
attention-based neural machine translation,” 2015, arXiv:1508.04025. pp. 5528–5531.
[26] M. Jaderberg, K. Simonyan, A. Zisserman, and K. Kavukcuoglu, [40] I. Sutskever, J. Martens, and G. E. Hinton, “Generating text with
“Spatial transformer networks,” in Advances in Neural Information recurrent neural networks,” in Proc. ICML, 2011, pp. 1017–1024.
Processing Systems, vol. 28. Red Hook, NY, USA: Curran, 2015, [41] N. Kitaev, Ł. Kaiser, and A. Levskaya, “Reformer: The efficient
pp. 2017–2025. transformer,” 2020, arXiv:2001.04451.
[27] D. Amodei et al., “Deep speech 2: End-to-end speech recognition [42] J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap,
in English and mandarin,” in Proc. Int. Conf. Mach. Learn., 2016, “Compressive transformers for long-range sequence modelling,” 2019,
pp. 173–182. arXiv:1911.05507.
[28] T. J. O’Shea, J. Corgan, and T. C. Clancy, “Convolutional radio modula- [43] Z. Wang, Y. Ma, Z. Liu, and J. Tang, “R-transformer: Recurrent neural
tion recognition networks,” in Proc. Int. Conf. Eng. Appl. Neural Netw., network enhanced transformer,” 2019, arXiv:1907.05572.
2016, pp. 213–226. [44] X. Chu et al., “Twins: Revisiting the design of spatial attention in vision
[29] S. Zheng, P. Qi, S. Chen, and X. Yang, “Fusion methods for transformers,” 2021, arXiv:2104.13840.
CNN-based automatic modulation classification,” IEEE Access, vol. 7, [45] Y. Sha, Y. Zhang, X. Ji, and L. Hu, “Transformer-Unet: Raw image
pp. 66496–66504, 2019. processing with Unet,” 2021, arXiv:2109.08417.
[30] D. Hong, Z. Zhang, and X. Xu, “Automatic modulation classification [46] Y. Wang, X. Zhang, T. Yang, and J. Sun, “Anchor DETR: Query design
using recurrent neural networks,” in Proc. 3rd IEEE Int. Conf. Comput. for transformer-based object detection,” 2021, arXiv:2109.07107.
Commun. (ICCC), 2017, pp. 695–700. [47] A. Dosovitskiy et al., “An image is worth 16 x 16 words: Transformers
[31] S. Rajendran, W. Meert, D. Giustiniano, V. Lenders, and S. Pollin, “Deep for image recognition at scale,” 2020, arXiv:2010.11929.
learning models for wireless signal classification with distributed low- [48] T. J. O’shea and N. West, “Radio machine learning dataset gener-
cost spectrum sensors,” IEEE Trans. Cogn. Commun. Netw., vol. 4, no. 3, ation with GNU radio,” in Proc. GNU Radio Conf., vol. 1, 2016,
pp. 433–445, Sep. 2018. pp. 1–6.
[32] N. E. West and T. O’Shea, “Deep architectures for modulation recog- [49] T. J. O’Shea, T. Roy, and T. C. Clancy, “Over-the-air deep learning
nition,” in Proc. IEEE Int. Symp. Dyn. Spectr. Access Netw. (DySPAN), based radio signal classification,” IEEE J. Sel. Topics Signal Process.,
2017, pp. 1–6. vol. 12, no. 1, pp. 168–179, Feb. 2018.
[33] T. J. O’Shea, L. Pemula, D. Batra, and T. C. Clancy, “Radio transformer [50] M. Abadi et al., “TensorFlow: A system for large-scale machine learn-
networks: Attention models for learning to synchronize in wireless ing,” in Proc. 12th USENIX Symp. Oper. Syst. Des. Implement. (OSDI),
systems,” in Proc. 50th Asilomar Conf. Signals Syst. Comput., 2016, 2016, pp. 265–283.
pp. 662–666. [51] C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn.,
[34] M. Mirmohammadsadeghi, S. S. Hanna, and D. Cabric, “Modulation vol. 20, no. 3, pp. 273–297, 1995.
classification using convolutional neural networks and spatial trans- [52] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE
former networks,” in Proc. 51st Asilomar Conf. Signals, Syst., Comput., Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, Jan. 1967.
2017, pp. 936–939. [53] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and
[35] M. Li, O. Li, G. Liu, and C. Zhang, “An automatic modulation recogni- P. T. P. Tang, “On large-batch training for deep learning: Generalization
tion method with low parameter estimation dependence based on spatial gap and sharp minima,” 2016, arXiv:1609.04836.
transformer networks,” Appl. Sci., vol. 9, no. 5, p. 1010, 2019. [54] E. Perenda, S. Rajendran, G. Bovet, S. Pollin, and M. Zheleva,
[36] A. Vaswani et al., “Attention is all you need,” in Advances in Neural “Learning the unknown: Improving modulation classification
Information Processing Systems. Red Hook, NY, USA: Curran, 2017, performance in unseen scenarios,” in Proc. IEEE Infocom, 2020,
pp. 5998–6008. pp. 1–10.

Authorized licensed use limited to: University of Arizona. Downloaded on March 25,2025 at 06:21:26 UTC from IEEE Xplore. Restrictions apply.

You might also like