TOR Anonymous Traffic Fingerprint Extraction and R
TOR Anonymous Traffic Fingerprint Extraction and R
TaoPu*, JuanWang
*School of Cyberspace Security, Chengdu University of Information Technology
1. Introduction
Nowadays, various countries in the world have regarded network information security as an important
development direction of the current era. With the rapid development of the Internet, network
communication security has become a hot spot, and criminals have embezzled users' private
information to do illegal things. People have to start looking for more reliable communication tools,
and anonymous communication systems such as TOR (The Onion Router) and SSH (Secure Shell)
have emerged. Due to SSH copyright issues, people use TOR more widely than SSH on the market.
The TOR anonymous network was proposed by scholar Roger [1] and others. It is a link-based, low-
latency anonymous network. Compared with other anonymous network communications, the TOR
anonymous network has stronger availability and more convenient deployment and features such as
sex and safety. It can provide users with efficient, reliable, flexible and secure services through simple
deployment.
This paper proposes a new type of anonymous traffic identification method [2-3], which is a traffic
analysis technology based on neural network self-learning. It uses passive traffic analysis and
comparison to identify the real address information of the communication terminal of the regulated
party. purpose. Compared with the traditional identification method, the innovation of this study is that
the method uses neural network to automatically extract the characteristics of anonymous traffic data,
thereby reducing to a certain extent problems such as the deviation of experimental conclusions caused
by manual extraction. At the same time, it is the same as Naive Bayes and Random Forest
Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution
of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Published under licence by IOP Publishing Ltd 1
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012051 doi:10.1088/1742-6596/1757/1/012051
2. Related technology
x1: p x1 , x2 , x p
(1)
1h
Among them, b R is the bias term and f is the activation function. Commonly used activation
functions include Relu, sigmod and tanh. The feature map c obtained by a convolution kernel sliding
x
on 1:p is:
c c1 , c2 , cnh1
(3)
In order to extract more high-dimensional data features, multiple different convolution kernels are
x
usually used to extract features on 1:p , and all feature maps are superimposed and input into the next
layer of neural network.
This article uses the characteristics of data processing in one-dimensional convolutional neural
networks. Network traffic is essentially a series of data. These data can be regarded as discrete values
in time series, and then a one-dimensional convolutional neural network is used to extract relevant
features from the data to form an identification fingerprint for TOR traffic to identify TOR traffic.
2
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012051 doi:10.1088/1742-6596/1757/1/012051
1
FLOW a1 , a2 , am 1 , a1 , a2 , am 2 , a1 , a2 , am k , a 0, 255
255
(4)
Among them, k refers to the number of sessions, and m is the byte size of each session.
3
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012051 doi:10.1088/1742-6596/1757/1/012051
Convert the data stream into numerical form as the neural network input.
First enter the convolutional layer of the neural network. Cells are distributed in the convolutional
layer. The convolution kernel is used to establish the mapping relationship between the input and the
feature. The convolution operation can make the data learn the same features of the data input in
different cells , Forming a series of high-dimensional feature sets;
Then enter the pooling layer. The pooling process combines the features learned by the
convolutional layer and combines them into higher-level features to reduce the overall number of
features;
Transfer the pooled features to the fully connected layer, and the fully connected layer at the end of
the neural network will perform the feature classification operation [5].
4
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012051 doi:10.1088/1742-6596/1757/1/012051
It can be seen from the data in the table that when the training set and test set are constant, the
larger the time interval, the lower the accuracy. It shows that the recognition accuracy of website
fingerprints is greatly affected by the time factor.
5
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012051 doi:10.1088/1742-6596/1757/1/012051
It can be seen from the experimental results that the accuracy of using the one-dimensional
convolutional neural network algorithm is the highest. Compared with Naïve Baye, Random Forest
has a slightly higher accuracy. The analysis algorithm shows that the Naïve Bayes method only uses
the two-tuple of the packet length and the data direction as features. The Random Forest method adds
the total number of packets, the total size of the site, and the required time on the basis of the Naïve
Bayes method. And the data packets are classified according to the size interval, so that the accuracy
rate is improved to a certain extent. However, the Random Forest method and the Naïve Bayes method
both select fixed features and have certain limitations. The automatic feature extraction method used
by CNN essentially solves the problemof feature selection, so the accuracy rate is higher.
5. Conclusion
Aiming at the difficulty of monitoring the anonymous communication system abused by criminals,
this paper proposes a TOR anonymous traffic feature extraction and identification method based on
one-dimensional convolutional neural network. This method integrates feature extraction and
prediction classification, and uses the CNN model to achieve fingerprint recognition for TOR
anonymous traffic. It has achieved an accuracy of 87.5% on the experimental data set. It can
effectively identify TOR anonymous traffic and anonymous communications. The network realizes
supervision and review. This article also demonstrates the advancement and versatility of one-
dimensional neural network methods by comparing Naive Bayes, Random Forest and other methods.
The experimental results show that the influence of the number of data sets on the recognition
accuracy cannot be ignored. Therefore, a large number of data sets are needed to train the model. This
is a challenge for research, and it is not easy to obtain anonymous network experiment data. How to
capture more anonymous network traffic and use unsupervised deep learning algorithms for feature
extraction and recognition is a direction worthy of further research. This article only does a
preliminary exploration, and will further optimize the model construction and network parameters in
the future.
6
ICCBDAI 2020 IOP Publishing
Journal of Physics: Conference Series 1757 (2021) 012051 doi:10.1088/1742-6596/1757/1/012051
Acknowledgments
The work is supported by Sichuan Science and Technology Program, No.2019YFG0425, No.2019YF
G0186, No.2019YFG0408 and the open Project of Network and Data security Key Laboratory of Sich
uan Province (No. NDS2018-1).
References
[1] Wang T, Goldberg I 2013 Improved website fingerprinting on TOR in: Proceedings of the 12th
ACM workshop on Workshop on privacy in the electronic society( Chicago: ACM
Press)pp201-210
[2] Guang Z Security analysis and research of Tor anonymous communication system( 2011 the
academic dgree's dissertation of Shanghai Jiaotong University)
[3] Carhartt 2005 Cryptography and Network Security(Beijing:Tsinghua University Press)
[4] Xin L Research on Anonymous Communication Based on Tor Network (2011 the academic
dgree's dissertation of East China Normal University)
[5] Zhang S , Deng Z , Cheng D et al 2016 Efficient kNN Classification Algorithm for Big Data
J.Neurocomputing pp143-148