Classification and identification of unknown network protocols based on CNN and T-SNE
Classification and identification of unknown network protocols based on CNN and T-SNE
Series
Abstract—With the continuous development of users' demands and network technology, more
and more new network protocols emerge, which poses great challenges to network protocol
classification and identification. An artificial intelligence method was used to explore
autonomous classification and identification of unknown network protocols in this paper in order
to reduce the time and labor cost of network protocol classification and identification. In this
paper, firstly, the network traffic was converted into grayscale images, and through transfer
learning, the Convolutional Neural Networks (CNN) pre-trained model was used to extract the
protocol features, so as to reduce the time and the amount of labeled data needed for the artificial
neural network training. Finally, with the improved unsupervised hybrid clustering algorithm
based on T-SNE and K-means, the types and number of protocols were autonomously identified
and the network traffic was classified simultaneously. In this way, we can identify unknown
protocols without prior knowledge and the protocol identification adaptability for big data was
also greatly improved. Experimental results show this method has high accuracy and robustness
in identifying unknown network protocols.
1. INTRODUCTION
With the increasing scale of network communication and the constant change of people’s needs, more
encrypted traffic and private protocols appear on the Internet. The classification and identification of
unknown network protocols can provide support for further protocol reverse parsing, and therefore more
accurate protocol detection through clustering analysis[1]. Research on classification and identification
technology of unknown network protocols can effectively provide technical support for detecting illegal
intrusion, monitoring the traffic flow, analyzing user behavior and eventually ensuring network security.
Protocol identification can be achieved through many ways. The traditional method uses fixed port
numbers, but such method can be easily cheated by changing the port number in the system [2]. DPI
(Deep Packet Inspection) is the most commonly used protocol identification technology at present. It
needs to conduct further in-depth inspection on the header, payloads and other information of data packets.
However, it cannot identify unknown protocol types, and its feature database may cause heavy resource
consumption [3]. The method based on association rule mining for unknown protocol identification has
certain limitations. For example, in the case of real-time large-scale network protocol analysis, the
computational complexity is enormous [4]. Machine learning methods have a powerful adaptive and
learning capability, and have developed rapidly in the field of protocol analysis. Generally speaking,
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
2nd International Conference on Electronic Engineering and Informatics IOP Publishing
Journal of Physics: Conference Series 1617 (2020) 012071 doi:10.1088/1742-6596/1617/1/012071
machine learning is mainly divided into unsupervised learning and supervised learning. Unsupervised
learning methods are often used to identify unknown protocols and can mine data features without
category information. Hong et al. [1] proposed an application layer protocol classification and
identification method which combines the traditional DPI technology and clustering methods to adapt to
the number of target clusters, and can efficiently classify and identify unknown application layer
protocols. Peng et al. [5] used mathematical statistics to calculate the K value and the cluster initial center
of the K-means clustering algorithm and realized data clustering. Zhang et al. [6] combined the traditional
AGNES hierarchical clustering algorithm with the features of the bitstream data frames, and proposed a
classification method for protocols with unknown bitstreams. This method can automatically identify the
number of clusters and classify unknown bitstream data frames. However, most protocol identification
methods based on traditional machine learning require manual feature selection as input in advance in
order to further classify and identify protocols.
Supervised learning is a method that trains models to predict identification results. Deep learning is a
typical supervised learning method, and can convert data into data that can be learned by machines. It
autonomously transforms low-level features into complex high-level features for representing the
attributes of input images, in order to learn the inherent rules and the representation levels of sample data.
This end-to-end learning method is free from the complex steps of extracting features in advance and
increases the automation level of the protocol classification and identification. In real-time analysis of
online network traffic and big data volume analysis, such as image and video classification and
identification analysis, this method has achieved good results. Wang et al. [7] first proposed the idea of
treating the bit data of traffic as pixels of an image and applied deep learning to traffic classification and
identification. Based on the similarity between network traffic and images, Zhang et al. [8] directly used
network traffic data as input of CNN to train the classification and identification capability for the model.
Wang et al. [9] was the first to realize the classification of malware by using the characterization learning
method of raw data, and improved the accuracy of the classification and identification. Li et al. [10]
proposed a Byte Segment Neural Network (BSNN). This Neural Network does not require a priori
knowledge and can handle both connection-oriented and connectionless protocols simultaneously. Deep
learning has achieved success in protocol classification and identification. However, they depend too
much on labeled data, and there is a lack of recognized datasets for protocol classification and
identification. Most researchers adopt raw traffic data captured under their respective experiment network
conditions and the data are always labeled by category through manual methods or DPI tools, which has
low accuracy and complicated steps [11]. In addition, how to use deep learning methods to distinguish
between known and unknown network protocols in data traffic and analyze unknown protocols is still a
problem in research on network protocol classification and identification.
This paper proposed a new classification and identification method for unknown network protocols,
and this method has the advantages of both deep learning and unsupervised learning. It does not rely too
much on data to train the model, and can directly use CNN to obtain the features of unknown protocols.
In this paper, first, CNN with pre-trained model weight was used to automatically extract the features of
unknown network protocols. Then, through the improved dimension reduction algorithm of T-SNE, the
dimensions of the features were intelligently reduced and the number of unknown protocols is identified.
Finally, using the distance selection feature of the K-means algorithm, we directly realized the unknown
protocol classification of the traffic data.
2. MOTHEDS
2
2nd International Conference on Electronic Engineering and Informatics IOP Publishing
Journal of Physics: Conference Series 1617 (2020) 012071 doi:10.1088/1742-6596/1617/1/012071
Step 2 (data conversion): uniformly convert the hexadecimal data of the payload part into binary
bitstreams to facilitate the subsequent generation of the grayscale images.
Step 3 (image generation): convert binary bitstreams into grayscale images. The binary value 1
corresponds to gray value 256, and 0 corresponds to 0. As the lengths of the binary bitstream vary, it is
not conducive to generating regular square images that can be recognized by CNN. Here, it is stipulated
that the binary bitstream with insufficient length will be supplemented with 0 at the end. The specific
rules are as follows, if 𝑛 1 𝑙 𝑛 ,𝑛 𝑙 0s must be added at the end of the binary bitstream. n
represents the pixel value of the edge, and l represents the length of the binary bitstream. Finally, the
converted gray values were stored in the matrix in the form of n×n in order, and saved in an image format.
3
2nd International Conference on Electronic Engineering and Informatics IOP Publishing
Journal of Physics: Conference Series 1617 (2020) 012071 doi:10.1088/1742-6596/1617/1/012071
the labeled data required for training. Transfer learning can be roughly divided into: instance-based deep
transfer learning, mapping-based deep transfer learning, network-based deep transfer learning, and others
[12]. Keras pre-trained models LeNet, AlexNet, VGG, Inception and ResNet deliver good performance
in network-based deep transfer learning [13]. Keras pre-trained models usually refer to convolutional
neural networks trained on ImageNet, which are generally used for the architecture of vision-related tasks.
The ImageNet dataset used for training contains approximately 1 million images, which can be divided
into 1,000 categories [14].
where 𝜎 has different values for different point 𝑥 , and the Gaussian mean square deviation centering on
data point 𝑥 is usually used as its value.
T-SNE minimizes the KL divergence by optimizing the difference between joint probability
distribution P in the high-dimensional space and joint probability distribution Q in the low-dimensional
space, The function can be defined as:
𝐶 𝐾𝐿 𝑃||𝑄 ∑ ∑ 𝑝 log , (2)
where 𝑝 and 𝑞 are the joint probabilities of high-dimensional space and low-dimensional space,
respectively. dimensional space and low-dimensional space, respectively. The value of 𝑝 is defined as
a symmetric conditional probability, and the value of 𝑞 is obtained through a T-distribution with DOF
(Degree of Freedom) = 1. The calculation formulas can be defined as:
| |
𝑝 ,
𝑞 ∑ ‖ ‖
. (3)
T-SNE uses the gradient descent method to solve the optimization objective problem, so the optimized
gradient can be obtained, as shown in (4):
4∑ 𝑝 𝑞 𝑦 𝑦 1 ‖𝑦 𝑦‖ , (4)
4
2nd International Conference on Electronic Engineering and Informatics IOP Publishing
Journal of Physics: Conference Series 1617 (2020) 012071 doi:10.1088/1742-6596/1617/1/012071
𝒴 𝒴 𝜂 𝛼 𝑡 𝒴 𝒴 , (5)
𝒴
where 𝒴 is the solution to the t-time iterations, 𝛼 𝑡 is the momentum of the t-time iterations, and η is
the learning rate.
Setting parameters: Calculate conditional Calculate High-
High- Perplexity Perp; probability pi|j dimensional joint
dimensional Iterations T;
feature vector Learning rate η ; according to probability pij
Momentum α(t) Equation (1) according to Equation (3)
Yes
Get
K-means
classification
clustering
results
Figure 3. Overall flowchart of the dimension reduction identification for high-dimensional feature
In Fig. 3, the dimensions of the high-dimensional feature vectors were first reduced to 50 dimensions
through PCA, and then to 2 dimensions through T-SNE. Finally, the K-means algorithm was used to
realize the classification and identification of the traffic data.
5
2nd International Conference on Electronic Engineering and Informatics IOP Publishing
Journal of Physics: Conference Series 1617 (2020) 012071 doi:10.1088/1742-6596/1617/1/012071
3. EXPERIMENTS
The experimental dataset in this paper was the actual network traffic data captured by Wireshark, we
selected the unencrypted traffic data for testing, including 12 protocol types such as common application
layer protocols of HTTP, DNS, SMTP, and FTP, and private application protocols of OICQ, WOW, and
others. The selected traffic data were saved in the. pcap format, and the pre-processed traffic images were
saved in the .jpg format.
In order to better analyze the classification and identification performance of the algorithm proposed
in this paper, the following four performance indicators were used in the experimental test: accuracy,
precision, recall and F1 score. Among them, the F1 score is the main indicator, which is the weighted
average of precision and recall indicators. An F1 score of 1 indicates that the algorithm performance in
the test is the best, while 0 is the worst.
6
2nd International Conference on Electronic Engineering and Informatics IOP Publishing
Journal of Physics: Conference Series 1617 (2020) 012071 doi:10.1088/1742-6596/1617/1/012071
All pre-trained models were tested with the same experimental parameters, including the number of
iterations. We randomly selected 150 traffic images of each of the three protocols of DNS, Facetime, and
HTTP as the test set. Considering that the different payload contents cause the image pixels to be non-
uniform, we uniformly reshaped the images to a size of 128 128. and the PCA+T-SNE+K-means
dimension reduction clustering algorithm was used. The average classification and identification results
of the three protocols are shown in Table II.
TABLE II. CLASSIFICATION AND IDENTIFICATION RESULTS OF DIFFERENT PRE-TRAINED MODELS
Pre-trained Model Accuracy F1 Score Precision Recall
ResNet-50 0.8978 0.8967 0.8966 0.8978
MobileNet 0.8956 0.8948 0.8970 0.8956
Xception 0.8222 0.8160 0.8291 0.8222
Inception V3 0.8067 0.8080 0.8198 0.8067
VGG19 0.7600 0.7579 0.7715 0.7600
VGG16 0.7578 0.7587 0.7599 0.7578
7
2nd International Conference on Electronic Engineering and Informatics IOP Publishing
Journal of Physics: Conference Series 1617 (2020) 012071 doi:10.1088/1742-6596/1617/1/012071
Table II ranks the different pre-trained models from top to bottom according to the accuracy. The
ResNet-50 model obtained the best result in terms of both accuracy and F1 score.
For DNS, FaceTime and HTTP protocols, we used the ResNet-50 pre-trained model to analyze each
protocol in more details. The clustering confusion matrix for the three protocols is shown in Fig. 10.
Coordinate labels 1, 2 and 3 correspond to three types of protocols: DNS, HTTP and FaceTime. The sum
of each column is the predicted number of the protocol category, the sum of each row is the actual number
of the protocol category. Among them, the classification and identification result of DNS is the best, and
a small number of HTTP protocol instances are confused with FaceTime protocols. The classification
and recognition results of each protocol are shown in Table III. The F1 scores of the three protocol types
are all higher than 84%, with that of DNS being the highest, reaching 97.09%. The overall accuracy is
89.78% and the average F1 score is 89.78%.
8
2nd International Conference on Electronic Engineering and Informatics IOP Publishing
Journal of Physics: Conference Series 1617 (2020) 012071 doi:10.1088/1742-6596/1617/1/012071
dimension reduction algorithms are shown in Table IV, and the results are the average classification and
identification results of the three protocols.
TABLE IV. Classification and identification results of three dimension
reduction algorithms
Algorithm Accuracy F1 Score Precision Recall
T-SNE 0.8978 0.8967 0.8966 0.8978
PCA+T-SNE 0.9 0.899 0.8991 0.9
PCA 0.8867 0.8862 0.8859 0.8867
It can be seen from Table IV that T-SNE has better performances than PCA. The main reason is that
PCA is a linear dimension reduction algorithm, which has difficulty in explaining the complex
polynomial relationship between features, while T-SNE finds out the structural relationship in data by
calculating the random probability distribution on the neighborhood graph. There was not much
difference between the results of T-SNE and PCA+T-SNE, but the integration of PCA and T-SNE can
reduce the calculation amount and time while ensuring the accuracy of the result. Therefore, the combined
dimension reduction algorithm of PCA and T-SNE was adopted in this paper.
Fig. 11 shows the results of the PCA+T-SNE algorithm after reduction and visualization of high-
dimensional protocol features, with red representing DNS, blue HTTP, and green FaceTime.
9
2nd International Conference on Electronic Engineering and Informatics IOP Publishing
Journal of Physics: Conference Series 1617 (2020) 012071 doi:10.1088/1742-6596/1617/1/012071
4. CONCLUSIONS
A classification and identification method for unknown network protocols based on CNN and T-SNE
was proposed in this paper. Through this method, first, the protocol data payload information from the
network traffic was extracted. Then, the payload information was converted into grayscale images, and
the CNN pre-trained model was used to extract features as the basis for protocol classification and
identification. Finally, dimension reduction clustering algorithms based on T-SNE and K-means were
adopted to intelligently cluster the feature vectors to efficiently and accurately realize the classification
and identification of unknown network protocols. This method made full use of the advantage of CNN’s
end-to-end learning. On the basis of ensuring the classification and identification accuracy, it avoided the
complex steps of manually extracting features and reduced the training time of the intelligent algorithm
as well as the amount of labeled data required.
This article is a preliminary exploration of deep metric learning in the identification of unknown
protocols, the protocol feature embeddings in the traffic information are extracted through the neural
network, and the protocol clustering and recognition can be realized through these standardized feature
embeddings. It turns out that the features extracted by the neural network are indeed can represent part
of the information of the protocol and has certain validity in the identification of unknown protocols. In
the future, we hope to combine the LMNN idea to optimize the CNN feature output process, increase the
feature similarity of the same protocol data and widen the differences between different protocol data to
improve the model's representation capability We will also do further research on encrypted traffic, and
try to use neural networks to find the potential characteristics of encrypted data.
ACKNOWLEDGMENT
This research was funded by the National Natural Science Foundation of China (61601516).
REFERENCES
[1] Hong Z, Gong Q, Feng W, Li Y. Unknown Application Layer Protocol Identification Based on
Adaptive Clustering. Computer Engineering and Applications. 2020, 56(05): 109-117.
[2] Moore A W, Zuev D. Internet traffic classification using bayesian analysis
techniques[C]//Proceedings of the 2005 ACM SIGMETRICS international conference on
Measurement and modeling of computer systems. 2005: 50-60.
[3] Guo L. Research on Multi-Business Identification Technology Oriented High-Speed Network
Management and Control. Doctor, The PLA Information Engineering University, Zhengzhou,
Henan, China, 2012.
[4] Lou S. Research on Parallel FP-Growth Association Rule. Master, University of Electronic Science
and Technology of China, Chengdu, Sichuan, China, 2016.
10
2nd International Conference on Electronic Engineering and Informatics IOP Publishing
Journal of Physics: Conference Series 1617 (2020) 012071 doi:10.1088/1742-6596/1617/1/012071
[5] Peng D, Xiang L, Li S, Yang C, Qiu Y. Classification of intelligent home protocol under multi-
protocols. Journal of Chongqing University of Posts and Telecommunications (Natural
Science Edition), 2018,30 (03): 321-328.
[6] Zhang F, Zhou H, Zhang J, Liu Y, Zhang C. A protocol classification algorithm based on improved
AGNES. Computer Engineering and Science, 2017,39 (04): 796-803.
[7] Wang Z. The applications of deep learning on traffic identification[J]. BlackHat USA, 2015, 24(11):
1-10.
[8] Zhang, L.; Liao, P.; Zhao, J.; Guo, L. A Method of Unknown Protocol Identification Based on
Convolution Neural Network. Microelectronics & Computer, 2018,35 (07): 106-108.
[9] Wang W, Zhu M, Zeng X, et al. Malware traffic classification using convolutional neural network
for representation learning[C]//2017 International Conference on Information Networking
(ICOIN). IEEE, 2017: 712-717.
[10] Li R, Xiao X, Ni S, et al. Byte segment neural network for network traffic classification[C]//2018
IEEE/ACM 26th International Symposium on Quality of Service (IWQoS). IEEE, 2018: 1-10.
[11] Feng W, Hong Z, Wu L, Fu M. Review of network protocol identification techniques. Computer
Applications. 2019, 39: 3604-3614.
[12] Tan C, Sun F, Kong T, et al. A survey on deep transfer learning[C]//International Conference on
Artificial Neural Networks. Springer, Cham, 2018: 270-279.
[13] Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural
networks? In: Advances in neural information processing systems. pp. 3320–3328 (2014).
[14] Keras – Home. Keras Documentation. Available online: https://keras.io/ (accessed on 8 February
2020).
[15] Maaten L, Hinton G. Visualizing data using t-SNE[J]. Journal of machine learning research, 2008,
9(Nov): 2579-2605.
[16] MacQueen J. Some methods for classification and analysis of multivariate
observations[C]//Proceedings of the fifth Berkeley symposium on mathematical statistics and
probability. 1967, 1(14): 281-297.
11