Cloud Computing and Security
Cloud Computing and Security
Cloud Computing
and Security
First International Conference, ICCCS 2015
Nanjing, China, August 13–15, 2015
Revised Selected Papers
123
www.allitebooks.com
Lecture Notes in Computer Science 9483
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zürich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany
www.allitebooks.com
More information about this series at http://www.springer.com/series/7409
www.allitebooks.com
Zhiqiu Huang Xingming Sun
•
Cloud Computing
and Security
First International Conference, ICCCS 2015
Nanjing, China, August 13–15, 2015
Revised Selected Papers
123
www.allitebooks.com
Editors
Zhiqiu Huang Junzhou Luo
Nanjing University of Aeronautics Nanjing University
and Astronautics Nanjing, Jiangsu
Nanjing China
China
Jian Wang
Xingming Sun Nanjing University of Aeronautics
Nanjing University of Information Science and Astronautics
and Technology Nanjing, Jiangsu
Nanjing China
China
LNCS Sublibrary: SL3 – Information Systems and Applications, incl. Internet/Web, and HCI
www.allitebooks.com
Preface
This volume contains the papers presented at ICCCS 2015: the First International
Conference on Cloud Computing and Security held during August 13–15, 2015, in
Nanjing, China. The conference was hosted by the College of Computer Science and
Technology at the Nanjing University of Aeronautics and Astronautics, who provided
the wonderful facilities and material support. We made use of the excellent EasyChair
submission and reviewing software.
The aim of this conference is to provide an international forum for the latest results
of research, development, and applications in the field of cloud computing and infor-
mation security. This year we received 158 submissions from eight countries, including
the USA, China, Japan, UK, India, Slovenia, Bangladesh, and Vietnam. Each sub-
mission was allocated to three Program Committee (PC) members and each paper
received on average 2.2 reviews. The committee decided to accept 47 papers.
The program also included four excellent and informative invited talks: “Domain
Identity: Tools for Privacy-by-Design in Cloud Systems” by Prof. Miroslaw
Kutylowski, Wroclaw University of Technology, Poland; “Cloud Security: Challenges
and Practice” by Prof. Hai Jin, Huazhong University of Science and Technology,
China; “Cloud-Centric Assured Information Sharing for Secure Social Networking” by
Prof. Bhavani Thuraisingham, University of Texas at Dallas, USA; “New Trends of
Cloud Computing Security” by Prof. Zhenfu Cao, Shanghai Jiao Tong University,
China.
We would like to extend our sincere thanks to all authors who submitted papers to
ICCCS 2015, and all PC members. It was a truly great experience to work with such
talented and hard-working researchers. We also appreciate the external reviewers for
assisting the PC members in their particular areas of expertise. Finally, we would like
to thank all attendees for their active participation and the organizing team, who nicely
managed this conference. We look forward to seeing you again at next year’s ICCCS.
www.allitebooks.com
Organization
General Chairs
Elisa Bertino Purdue University, USA
Zhiqiu Huang Nanjing University of Aeronautics and Astronautics, China
Xingming Sun Nanjing University of Information Science and
Technology, China
Program Committee
Co-chairs
Junzhou Luo Southeast University, China
Kui Ren State University of New York Buffalo, USA
Members
Arif Saeed University of Algeria, Algeria
Zhifeng Bao University of Tasmania, Australia
Elisa Bertino Purdue University, USA
Jintai Ding University of Cincinnati, USA
Zhangjie Fu Nanjing University of Information Science and
Technology, China
Lein Harn University of Missouri – Kansas City, USA
Jiwu Huang Sun Yat-sen University, China
Yongfeng Huang Tsinghua University, China
Hai Jin Huazhong University of Science and Technology, China
Jiwu Jing Institute of Information Engineering CAS, China
Jiguo Li Hohai University, China
Li Li Hangzhou Dianzi University, China
Qiaoliang Li Hunan Normal University, China
Xiangyang Li Institute of Illinois Technology, USA
Quansheng Liu University of South Brittany, France
Mingxin Lu Nanjing University, China
Yonglong Luo Anhui Normal University, China
Yi Mu University of Wollongong, Australia
Shaozhang Niu Beijing University of Posts and Telecommunications,
China
Xiamu Niu Harbin Institute of Technology, China
Jiaohua Qin Central South University of Forestry and Technology,
China
Yanzhen Qu Colorado University of Technology, USA
www.allitebooks.com
VIII Organization
Organizing Committee
Chair
Jian Wang Nanjing University of Aeronautics and Astronautics, China
Members
Dechang Pi Nanjing University of Aeronautics and Astronautics, China
Yu Zhou Nanjing University of Aeronautics and Astronautics, China
Youwen Zhu Nanjing University of Aeronautics and Astronautics, China
Juan Xu Nanjing University of Aeronautics and Astronautics, China
Liming Fang Nanjing University of Aeronautics and Astronautics, China
Xiangping Zhai Nanjing University of Aeronautics and Astronautics, China
Mingfu Xue Nanjing University of Aeronautics and Astronautics, China
www.allitebooks.com
Contents
Data Security
www.allitebooks.com
X Contents
System Security
www.allitebooks.com
Contents XI
Cloud Platform
Xuejun Dai1(&), Yuhua Huang1, Lu Chen1, Tingting Lu2, and Fei Su3
1
College of Computer Science and Technology, Nanjing University
of Aeronautics and Astronautics, Nanjing 210016, China
{daixuemai,hyuhua2k}@163.com, [email protected]
2
College of Civil Aviation, Nanjing University
of Aeronautics and Astronautics, Nanjing 211106, China
[email protected]
3
Suzhou Chinsdom Co. Ltd., Suzhou 215500, China
[email protected]
1 Introduction
In recent years, a number of security and high performance block cipher have been
designed to promote the development of cryptography [1]. For example, AES, Camellia
and SHACAL, etc. [2–4]. However, with the development of wireless network tech-
nology, ordinary block cipher is difficult to meet the mobile terminal resource-
constrained, lightweight cryptographic algorithms required to meet hardware and soft-
ware, computing power and energy consumption and other resource-constrained needs
College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics,
Nanjing, China. The subject has been supported by the Science & Technology Support Plan of
Jiangsu Province under Grant No. BE2013879 and the NUAA Research Funding under Grant
No. NS2010097.
In this paper, we propose a new 64-bit block cipher lightweight VH, namely
Vertical Horizontal. It expressly refers to the 64 bits plaintext arranged in 8 bytes * 8
bits of matrix, pseudo random transformation for each row and column of the matrix,
supporting key lengths of 64, 80, 96, 112 and 128 bits. One novel design methods of
VH is the use of SP structure, with a pseudo-random transformation to construct
256-byte encryption transformation table and decryption transformation table, simpli-
fied S-box design, which fully reflects the realization of lightweight block cipher to
achieve small space occupied. In addition, VH after P permutation, have been purged
diagonal pseudo-random data transformation, and improve its security. VH on cur-
rently known attack methods to achieve adequate immunity and exhibits efficiency in
hardware and software on consumption. VH-80’s hardware implementation requires
3182 GE [9]. The software efficiency of VH-80 is computed to be 44.47 Mb/s.
However, the hardware implementation of CLEFIA reaches 5979 GE and the software
efficiency is computed to be 33.90 Mb/s. We believe that the software and hardware
efficiency of VH algorithm is higher than the same platform for the eight international
standards CLEFIA algorithm. VH in the safety and performance achieves excellent
balance, which, on the basis of meeting the security, improves the efficiency of the
software, at the same time takes into account hardware efficiency. So the thus is more
feasible.
2 Specification of VH
2.1 Notations
The following notations are used throughout this paper.
– P: 64-bit plaintext
– Key: Master key
VH: A Lightweight Block Cipher 5
PlainText RoundKey K0
sBoxLayer
pLayer RoundKey K1
..
.
sBoxLayer
pLayer RoundKey Kr
CipherText
• Initial Encryption
The initial cipher text C0 ¼ P K0 , where P is the 64-bit initial plain text, K0 is the
first 8 bytes of the key K.
• r-round Encryption
For i within [1, r], each round of iteration has the following three steps:
Firstly, perform pseudorandom transformation on each byte of the data using the
encryption S-box: Mi ðjÞ ¼ S½Ci1 ðjÞ, where i range from 1 to r, Xi ðjÞ denotes the j byte
of Xi ; 0 ≤ j ≤ 7.
Secondly, arrange the 64-bit data Mi into a 8 * 8 matrix, perform pseudorandom
transformation on each diagonal line of Mi using the encryption S-box.
Finally, obtain the cipher text of the current iteration by performing XOR operation
on the output Pi above and the sub-key Ki of the current iteration: Ci ¼ Pi Ki , where
r range from 1 to r. The result output in the last round is the final cipher text C.
(4) Data Decryption Process
• Initial Decryption
Pr ¼ C Kr , where C is the 64-bit cipher text, Kr is the last 8 bytes of the
expanded key.
• r-round Decryption
For i within [1, r], each round of iteration has the following three steps:
Firstly, perform pseudorandom transformation on each byte of the data using the
decryption S-box: Mi ð jÞ ¼ S1 ½Pi þ 1 ðjÞ, where i range from 1 to r, Xi ðjÞ denotes the
j byte of Xi , 0 ≤ j ≤ 7.
Secondly, arrange the 64-bit data Mi into a 8 * 8 matrix, perform pseudorandom
transformation on each diagonal line of Mi using the decryption S-box
Finally, obtain the plain text of the current iteration by performing XOR operation
on the output Ci above and the sub-key Ki of the current round: Pi ¼ Ci Ki , where
i range from 1 to r. The result output in the last round is the final plain text P0 .
3 Performance Evaluation
implement CLEFIA-128, neglecting the effect of different ASIC libraries. The authors
in reference stated that 6 GE were needed to store 1 bit, and 2.67 GE were required for
a XOR operation [9].
For key expansion of VH-80, it requires 8 XOR gates or 21.36 GE to perform a
XOR operation; a 64-bit sub-key needs to be stored during each round, totaling 384
GE. It requires 384 GE to store the 64-bit plain text during encryption; the S-box
demands 200 GE; 384 GE are needed to store the 64-bit cipher text generated during
each round; the encryption process requires 704 AND gates, totaling 936.32 GE; 616
OR gates are used, totaling 819.28 GE; the XOR operation needs to be performed on
the cipher text and the sub-key during each round, demanding 12 XOR gates, totaling
53.4 GE. The number of gate circuits needed by VH and other lightweight block cipher
algorithms is given in Table 2. Compared with CLEFIA which needs 5979 GE for
hardware implementation, VH only demands 3182 GE. It demonstrates that hardware
www.allitebooks.com
10 X. Dai et al.
4 Security Evaluation
cryptanalysis of the structure of the block cipher [20–22]. Their method can find
different impossible differential paths. Impossible differential cryptanalysis of VH is
carried out by using this method, determining the maximum number of rounds is 6,
finding 8 non-differential paths.
6r
0; 0; 0; a; 0; a; a; a ð0; 0; 0; 0; a; a; a; aÞ p ¼ 1
9
6r
0; 0; a; 0; a; a; a; 0 ð0; 0; 0; 0; a; a; a; aÞ p ¼ 1
9
6r
0; a; 0; a; a; a; 0; 0 ð0; 0; 0; a; a; a; a; 0Þ p ¼ 1
9
6r
0; a; a; a; 0; 0; 0; a ð0; a; a; a; a; 0; 0; 0Þ p ¼ 1
9
6r
a; 0; 0; 0; a; 0; a; a ða; a; 0; 0; 0; 0; a; aÞ p ¼ 1
9
6r
a; 0; a; a; a; 0; 0; 0 ð0; 0; a; a; 0; 0; 0; 0Þ p ¼ 1
9
6r
a; a; 0; 0; 0; a; 0; a ða; a; a; 0; 0; 0; 0; aÞ p ¼ 1
9
6r
a; a; a; 0; 0; 0; a; 0 ða; a; a; a; 0; 0; 0; 0Þ p ¼ 1
9
5 Conclusion
References
1. Wu, W., Feng, D., Zhang, W.: Design and Analysis of Block Cipher (in Chinese). TsingHua
University Press, Beijing (2009)
2. Daemen, J., Rijmen, V.: The Design of Rijndael: AES-the Advanced Encryption Standard.
Springer, Heidelberg (2002)
3. Feng, D., Zhang, M., Zhang, Y.: Study on cloud computing security (in Chinese). J. Journal
of Software. 22, 71–83 (2011)
4. Lu, F., Wu, H.: The research of trust evaluation based on cloud model. J. Eng. Sci. 10, 84–90
(2008)
5. Shirai, T., Shibutani, K., Akishita, T., Moriai, S., Iwata, T.: The 128-Bit Blockcipher CLEFIA
(Extended Abstract). In: Biryukov, A. (ed.) FSE 2007. LNCS, vol. 4593, pp. 181–195.
Springer, Heidelberg (2007)
6. Tsunoo, Y., Tsujihara, E., Shigeri, M., Saito, T., Suzaki, T., Kubo, H.: Impossible
differential cryptanalysis of CLEFIA. In: Nyberg, K. (ed.) FSE 2008. LNCS, vol. 5086,
pp. 398–411. Springer, Heidelberg (2008)
7. Juels, A., Weis, S.A.: Authenticating pervasive devices with human protocols. In: Shoup, V.
(ed.) CRYPTO 2005. LNCS, vol. 3621, pp. 293–308. Springer, Heidelberg (2005)
8. Bogdanov, A.A., Knudsen, L.R., Leander, G., Paar, C., Poschmann, A., Robshaw, M.,
Seurin, Y., Vikkelsoe, C.: PRESENT: an ultra-lightweight block cipher. In: Paillier, P.,
Verbauwhede, I. (eds.) CHES 2007. LNCS, vol. 4727, pp. 450–466. Springer, Heidelberg
(2007)
9. Özen, O., Varıcı, K., Tezcan, C., Kocair, Ç.: Lightweight block ciphers revisited:
cryptanalysis of reduced round PRESENT and HIGHT. In: Boyd, C., González Nieto,
J. (eds.) ACISP 2009. LNCS, vol. 5594, pp. 90–107. Springer, Heidelberg (2009)
10. Izadi, M., Sadeghiyan, B., Sadeghian, S.: MIBS: a new lightweight block cipher. In: Garay,
J.A., Miyaji, A., Otsuka, A. (eds.) CANS 2009. LNCS, vol. 5888, pp. 334–348. Springer,
Heidelberg (2009)
11. Bay, A., Nakahara Jr., J., Vaudenay, S.: Cryptanalysis of reduced-round MIBS block cipher.
In: Heng, S.-H., Wright, R.N., Goi, B.-M. (eds.) CANS 2010. LNCS, vol. 6467, pp. 1–19.
Springer, Heidelberg (2010)
12. Biham, E., Shamir, A.: Differential Cryptanalysis of The Data Encryption Standard.
Springer, New York (1993)
13. Su, B., Wu, W., Zhang, W.: Differential cryptanalysis of SMS4 block cipher. In: IACR,
Cryptology Eprint Archive (2010)
14. Matsui, Mitsuru: Linear Cryptanalysis Method for DES Cipher. In: Helleseth, Tor (ed.)
EUROCRYPT 1993. LNCS, vol. 765, pp. 386–397. Springer, Heidelberg (1994)
15. Kanda, M., Takashima, Y., Matsumoto, T., Aoki, K., Ohta, K.: A strategy for constructing
fast round functions with practical security against differential and linear cryptanalysis. In:
Tavares, S., Meijer, H. (eds.) SAC 1998. LNCS, vol. 1556, pp. 264–279. Springer,
Heidelberg (1999)
16. Kanda, M.: Practical security evaluation against differential and linear cryptanalysis for
Feistel ciphers with SPN round function. In: Stinson, D.R., Tavares, S. (eds.) SAC 2000.
LNCS, vol. 2012, pp. 324–338. Springer, Heidelberg (2012)
17. Hong, S.H., Lee, S.-J., Lim, J.-I., Sung, J., Cheon, D.H., Cho, I.: Provable security against
differential and linear cryptanalysis for the SPN structure. In: Schneier, B. (ed.) FSE 2000.
LNCS, vol. 1978, pp. 273–283. Springer, Heidelberg (2001)
VH: A Lightweight Block Cipher 13
18. Liu, F., Ji, W., Hu, L., Ding, J., Lv, S., Pyshkin, A., Weinmann, R.-P.: Analysis of the SMS4
block cipher. In: Pieprzyk, J., Ghodosi, H., Dawson, E. (eds.) ACISP 2007. LNCS, vol. 4586,
pp. 158–170. Springer, Heidelberg (2007)
19. Ojha, S.K., Kumar, N., Jain, K., Sangeeta, : TWIS – a lightweight block cipher. In: Prakash,
A., Sen Gupta, I. (eds.) ICISS 2009. LNCS, vol. 5905, pp. 280–291. Springer, Heidelberg
(2009)
20. Biham, E., Biryukov, A., Shamir, A.: Cryptanalysis of Skipjack reduced to 31 rounds using
impossible differentials. In: Stern, J. (ed.) EUROCRYPT 1999. LNCS, vol. 1592, pp. 12–23.
Springer, Heidelberg (1999)
21. Zhang, W., Wu, W., Zhang, L., Feng, D.: Improved related-key impossible differential
attacks on reduced-round AES-192. In: Biham, E., Youssef, A.M. (eds.) SAC 2006. LNCS,
vol. 4356, pp. 15–27. Springer, Heidelberg (2007)
22. Kim, J.-S., Hong, S.H., Sung, J., Lee, S.-J., Lim, J.-I., Sung, S.H.: Impossible differential
cryptanalysis for block cipher structures. In: Johansson, T., Maitra, S. (eds.) INDOCRYPT
2003. LNCS, vol. 2904, pp. 82–96. Springer, Heidelberg (2003)
Security Against Hardware Trojan Attacks
Through a Novel Chaos FSM and Delay
Chains Array PUF Based Design
Obfuscation Scheme
1 Introduction
Hardware Trojan attacks have emerged as a major threat for integrated circuits
(ICs) [1–3]. A design can be tampered in untrusted design houses or fabrication
facilities by inserting hardware Trojans. Hardware Trojans can cause malfunction,
undesired functional behaviors, leaking confidential information or other catastrophic
consequences in critical systems. Methods against hardware Trojan attacks are badly
needed to ensure trust in ICs and system-on-chips (SoCs).
However, hardware Trojans are stealthy by natures which are triggered under rare
conditions or specific conditions that can evade post-manufacturing test. Moreover,
there are large number of possible Trojan instances the adversary can exploit and many
diverse functions of Trojans which makes hardware Trojan detection by logic testing
extremely challenging.
Many side-channel signal analysis approaches [4–6] have been proposed to detect
hardware Trojans by extracting the parameters of the circuits, e.g. leakage current,
transient current, power or delay. However, these methods are very susceptible to process
variations and noises. Moreover, the detection sensitivity is greatly reduced for small
Trojans in modern ICs with millions of gates. A few regional activation approaches were
proposed to magnify Trojan’s contributions [7, 8]. However, large numbers of random
patterns were applied during detection, while computationally consuming training pro-
cesses were also used for pattern selection. Besides, these approaches need to alter the
original design and add more circuits during design phase which will increase the
complexity of the design process and produce a considerable overhead for large designs.
Most of the existing works require golden chips to provide reference signals for
hardware Trojan detection. However, obtaining a golden chip is extremely difficult [9].
The golden chips are supposed to be either fabricated by a trusted foundry or verified to
be Trojan-free through strict reverse engineering. Both methods are prohibitively
expensive. In some scenarios, the golden chips even don’t exist, e.g., if the mask is
altered at the foundry. Recently, a rare few methods are proposed to detect HT without
the golden chips by using self-authentication techniques [9–11]. However, these
methods are not without limitations [12]. They always need expensive computations,
sophisticated process variation models and a large number of measurements to ensure
accuracy for large deigns.
R.S. Chakraborty et al. propose an application of design obfuscation against
hardware Trojan attacks [13, 14]. However, the key sequence is the same for all the
chips from the same design which makes the approach itself vulnerable to various
kinds of attacks, e.g. the fab can use the key to unlock all the chips and then tamper or
overbuild the chips arbitrarily. The fab or the user can also release the key sequence of
the design in the public domain which makes the chip vulnerable to reverse-
engineering attacks. Besides, the procedure of determination of unreachable states is
also time consuming and computationally complex.
This paper presents a novel design obfuscation scheme against hardware Trojan
attacks based on chaos finite state machine (FSM) and delay chains array physical
unclonable function (PUF). The obfuscation scheme is realized by hiding and locking the
original FSM using exponentially many chaotic states. We exploits the pseudo-random
16 M. Xue et al.
characteristics of the M-sequences [15] to propose a chaos FSM design method which can
generate many random states and transitions with low overhead. The chip is obfuscated
and would not be functional without a unique key that can only be computed by the
designer. We also propose a new PUF construction method, named delay chains array
PUF (DAPUF), to extract the unique signature of each chip. Each chip has a unique
key sequence corresponding to the power-up state. We introduce confusions between
delay chains to achieve avalanche effects of the PUF outputs. Thus the proposed DAPUF
can provide large number of PUF instances with high accuracy and reverse-engineering
resistant. Through the proposed scheme, the designer can control the IC’s operation
modes (chaos mode and normal mode) and functionalities, and can also remotely disable
the chips when hardware Trojan is revealed. The functional obfuscation prevents
the adversary from understanding the real functionalities of the circuit as well as the real
rare events in the internal nodes, thus making it difficult for the adversary to insert
hard-to-detect Trojans. It also makes the inserted Trojans become invalid since the
Trojans are most likely inserted in the chaos mode which will be activated only in the
chaos mode. Both simulation experiments on benchmark circuits and hardware evalua-
tions on FPGA platforms show the security, low overhead and practicality of the pro-
posed method.
2 Overall Flow
First, the designer exploits the high level design description to form the FSM. Then, the
original FSM is modified based on the nonlinear combination of the M-sequences to
generate the chaos FSM which adds exponentially many random states and transitions
to obfuscate the functional states. Then, the DAPUF module is constructed to generate
the unique signature of each IC. After the subsequent design procedures are performed,
the foundry will receive necessary information to fabricate the chips. Each chip is
locked upon fabrication. In the proposed chaos FSM construction scheme, up to 220
chaotic states are added, so the functional states of the chip are submerged in a large
number of chaotic states. When the chip is powered on, it will be a great probability to
fall into the chaotic states thus is non-functional, named chaos mode.
Then, the foundry applies the set of challenge inputs to the PUF unit on each chip
and sends the PUF response to the designer. Each PUF response is corresponding to a
unique power-up state. The designer who masters the M-sequences and transition table
is the only entity who can compute the passkey to make the chip enter into functional
states. After applying the passkey, the locked IC will traverse a set of transitions and
reach to the functional reset state. Then the IC is unlocked and becoming functional,
called normal mode.
Figure 1 shows the proposed design obfuscation scheme. In this sample, there are 8
states in the original FSM. 24 chaotic states are generated adding to the original FSM.
4 black hole states are introduced to enable the designer to disable the chips when
needed. The DAPUF module generates a unique identifier for each chip corresponding
to a unique power-up state. The power-up state is most likely to fall into the chaos FSM
making the chip nonfunctional until an IC-unique passkey is applied. The passkey can
make the chip traverse through a set of transitions and reach to the functional reset state.
Security Against Hardware Trojan Attacks 17
O1 O2
reset state
O0 O3
Original
FSM
O7 O4
O6 O5
unique IC
identifier
power-up ×
DAPUF S8 S9 S10 S17 S18 S19
state
challenge response Chaos
FSM
S31 × S30 S29 × S22 S21 × S20
B1 B2 B3 B4 Blackhole
states
Fig. 1. The chaos FSM and DAPUF based design obfuscation scheme against hardware Trojans
D12(C12)
S1 × S2
C01 C23
O0 C24 S3
C72 C15
C70 C36
× D34(C34)
S7 S4
C67 C45
S6 S5
C56
connected; (3) there is at least one chaotic state can reach the reset state O0 through one
transition CiO0 .
The first step of the chaos process is to add new random transition paths to the
existing circular STG. In Fig. 2, that is, adding new Cij where j 6¼ i þ 1, such as
C15 ; C24 ; C36 ; C72 . The second step is to disconnect several transition paths randomly,
that is, replacing Cij with Dij , such as D12 and D34 . Each time a Dij is added, we should
check the connectivity between state i and state j, which means, there is at least one
transition path Cii1 ; Ci1 i2 ; . . .; Cim j to connect state i and state j. Otherwise, one need to
add the corresponding Cin im to ensure the connectivity.
Suppose the power-up state is Sn , there is one transition path CSn i1 ; Ci1 i2 ; . . .; Cim O0
to ensure that the power-up state and the reset state O0 is connected. We can get
O0 ¼ FfF. . .fFfSn ; Cni1 ; KSn i1 gCi1 i2 ; Ki1 i2 g. . .Cim O0 ; Kim O0 g. To unlock the chip, the
transition key KSn i1 ; Ki1 i2 ; . . .; Kim O0 is needed. The designer can increase the number of
required transition keys to increase the complexity of the key space. Since there is a
huge number of chaotic states, adding one transition path will exponentially increase
the key space while the added overhead is negligible.
In the chaos FSM construction, the designer can also create black hole states to
disable the chip when needed [16]. Black hole states are the states that can’t return to
functional states regardless of the input sequences. When the fab provide the power-up
state to the designer asking for the unlock passkeys, the designer may deliberately
provide a specific input sequence to make the chip enter into black hole states rather
than reset state once hardware Trojan insertion is revealed.
The changes of chip’s operating conditions, including temperature and voltage, etc.,
will affect physical characteristics of the chip, thus affecting the PUF. The traditional
solution is to add redundancy or use the error correcting codes. Two advantages of the
Security Against Hardware Trojan Attacks 19
X
kB
0
X
kD
0
ðtB þ tBn þ rÞ þ T ¼ ðtD þ tDn þ rÞ ð1Þ
n¼1 n¼1
In which, kB and kD are the number of propagated buffers and the number of
propagated D flip-flops when the end signal catches up with the start signal. We can
obtain the tuple (kB, kD) through the time discriminator. This tuple (kB, kD) is unique for
each chip and physically stochastic.
We define m groups of input signals as fX1 ; Y1 ; T1 g; . . .; fXm ; Ym ; Tm g, where
X1 ; X2 ; . . .; Xm represent the high level input signals of the D flip-flop, Y1 ; Y2 ; . . .; Ym
represent the high level input signals of the buffer, and T1 ; T2 ; . . .; Tm represent the time
difference between input signal Xg and Yg , 1 g m. Let’s define:
www.allitebooks.com
20 M. Xue et al.
where WfXg ; Yg ; Tg g represents the delay chain’s response to the gth group of input
signals. kgB and kgD are the number of propagated buffers and the number of propagated
D flip-flops when the signal Yg catches up with the signal Xg . Thus, we can obtain a
group of tuples fðk1B ; k1D Þ; ðk2B ; k2D Þ; . . .; ðkmB ; kmD Þg based on the response of the
DPUF. Then, we can extract the PUF characteristic vector:
*
E ¼ ½ðk2B ; k2D Þ ðk1B ; k1D Þ; ðk3B ; k3D Þ ðk2B ; k2D Þ; . . .; ðkmB ; kmD Þ
ðkðm1ÞB ; kðm1ÞD Þ ð3Þ
*
In which, E is the power-up state of the chip. Generally, one can consider that the
working conditions of different regions within one chip are the same. We use relative
value for characterization can eliminate the effect of working conditions.
However, there are some shortages if we only use one-dimensional delay chain
PUF (DPUF). Since the precision is only 1 ns, the physical unique characteristic of
each chip is quantified by this precision. Thus, the number of possible PUF instances is
reduced due to this limitation. However, less PUF instances may cause the collision
problem, which means two chips with different characteristics may have the same
DPUF after quantified. Therefore, this paper proposes delay chains array PUF
(DAPUF) which can magnify the delay differences thus significantly increase the
number of PUF instances.
The block diagram of the proposed DAPUF is shown in Fig. 4. I1 ; I2 ; . . .; I2h
represent the rising edges of the input signals, in which, h is the number of delay
chains. We add confusions between delay chains by the configuration of Conij . Conij
represents
( whether the node in row i column j is connected. In other words, it satisfies:
1 connected
Conij ¼ . Obviously, the coming moment of the rising edge in each
0 disconnected
node Pi0 j0 depends on the values of all Conij ði\i0 ; j\j0 Þ. It’s illustrated in a simple
example, as shown in Fig. 4, since Con11 ; Con22 ; Con31 are 1, the time of the rising
edge in node P23 is affected by I1 , I2 and I3 . The time discriminator will detect the first
(kB, kD) rising edges of each delay chain, and extract the corresponding characteristic
* * * *
vector E . The final DAPUF characteristic vector is fE1 ; E2 ; ; Eh g. Obviously, the
configurations of the Conij significantly magnify the delay differences between delay
chains and greatly increases the number of PUF instances with avalanche effects.
Security Against Hardware Trojan Attacks 21
delay I1 D Q
D Q
D Q
D Q
D Q
chain CLR
Q
CLR
Q
CLR
Q
CLR
Q
CLR
Q
PUF Con21 Con22 Con23
Time discriminator
I2 P23
SET
Con31 SET
Con32 SET
Con33 SET SET
delay I3 D Q
D Q
D Q
D Q
D Q
chain CLR
Q
CLR
Q
CLR
Q
CLR
Q
CLR
Q
PUF
I4
Fig. 4. The structure of the proposed delay chains array PUF (DAPUF)
and will traverse in the chaos FSM. The states of the chaos FSM are chaotic and
unpredictable. When a wrong passkey is applied, the module starts to perform passkey
identification. Since this is a wrong passkey, after the passkey identification, the chip
continues to traverse in the chaos FSM until the correct passkey (00010100100100
001100111100000001) is applied. Then the module performs passkey identification
again. The chaos FSM will traverse to the reset state “000” through several transition
paths and the chip is unlocked. Then, the chip will enter into the normal mode.
Table 1 shows the overhead of the proposed scheme on ISCAS89 benchmark
circuits as well as the CC1200 wireless transmitter control circuit. The overhead has
two main parts, the DAPUF construction and the chaos FSM implementation.
The DAPUF module in our experiments consists of 28 delay chains with each delay
chain contains 10 buffers and 8 D flip-flops. The results show that this DAPUF module
costs 156 logic elements (LE). We use a large scale DAPUF in the experiment to
ensure obtaining massive PUF instances and strong avalanche effects. The overhead of
the chaos FSM implementation mainly comes from the polynomial coefficients of the
M-sequences, the added transition paths table, and the storage of the passkeys. In this
paper, there are 220 chaotic states and the unlocking process needs 4 passkeys with each
key is 128 bit long. Note that, in related works, the PUF weren’t included in the
evaluations. If we remove the overhead of the DAPUF, the overhead of the scheme is
rather small. It is shown that when the circuit’s size becomes larger, the overhead
becomes much smaller. For modern circuits with millions of gates, the overhead of the
scheme is negligible.
CLK
control siganl
start signal of
each DFF
end signal of
each buffer
6 Conclusion
We have developed a novel design obfuscation scheme against hardware Trojan attacks
based on chaos FSM and DAPUF. We propose a chaos FSM design method which can
generate many random states to obfuscate the original FSM. We also propose a new
PUF construction method obtaining large number of PUF instances with avalanche
effects. Through the proposed scheme, the designer can control the IC’s operation
modes and functionalities, and can remotely disable the chips. The obfuscation pre-
vents the adversary from understanding the real function and the real rare events of the
circuit, thus making it difficult to insert Trojans. It also makes the inserted Trojans
become invalid since they are most likely inserted in the chaos mode which will be
activated only in the chaos mode.
References
1. Bhunia, S., Hsiao, M.S., Banga, M., Narasimhan, S.: Hardware Trojan attacks: threat
analysis and countermeasures. Proc. IEEE 102(8), 1229–1247 (2014)
2. Rostami, M., Koushanfar, F., Karri, R.: A primer on hardware security: models, methods,
and metrics. Proc. IEEE 102(8), 1283–1295 (2014)
3. Bhunia, S., Abramovici, M., Agarwal, D., Bradley, P., Hsiao, M.S., Plusquellic, J.,
Tehranipoor, M.: Protection against hardware Trojan attacks: towards a comprehensive
solution. IEEE Des. Test 30(3), 6–17 (2013)
4. Agrawal, D., Baktır, S., Karakoyunlu, D., Rohatgi, P., Sunar, B.: Trojan detection using IC
fingerprinting. In: Proceedings of IEEE Symposium on Security and Privacy (SP 2007),
pp. 296–310. Berkeley, California, 20–23 May 2007
5. Nowroz, A.N., Hu, K., Koushanfar, F., Reda, S.: Novel techniques for high-sensitivity
hardware Trojan detection using thermal and power maps. IEEE Trans. Comput.-Aided Des.
Integr. Circuits Syst. 33(12), 1792–1805 (2014)
24 M. Xue et al.
6. Potkonjak, M., Nahapetian, A., Nelson, M., Massey, T.: Hardware Trojan horse detection
using gate-level characterization. In: 46th Design Automation Conference (DAC 2009),
pp. 688–693. San Francisco, California, USA, 26–31 July 2009
7. Banga, M., Hsiao, M.S.: A region based approach for the identification of hardware Trojans.
In: IEEE International Workshop on Hardware-Oriented Security and Trust (HOST 2008),
pp. 40–47. Anaheim, CA, 9–9 June 2008
8. Wei, S., Potkonjak, M.: Scalable hardware Trojan diagnosis. IEEE Trans. Very Large Scale
Integr. VLSI Syst. 20(6), 1049–1057 (2012)
9. Davoodi, A., Li, M., Tehranipoor, M.: A sensor-assisted self-authentication framework for
hardware Trojan detection. IEEE Des. Test 30(5), 74–82 (2013)
10. Wei, S., Potkonjak, M.: Self-consistency and consistency-based detection and diagnosis of
malicious circuitry. IEEE Trans. Very Large Scale Integr. VLSI Syst. 22(9), 1845–1853
(2014)
11. Xiao, K., Forte, D., Tehranipoor, M.: A novel built-in self-authentication technique to
prevent inserting hardware Trojans. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
33(12), 1778–1791 (2014)
12. Cao, Y., Chang, C.-H., Chen, S.: A cluster-based distributed active current sensing circuit
for hardware Trojan detection. IEEE Trans. Inf. Forensics Secur. 9(12), 2220–2231 (2014)
13. Chakraborty, R.S., Bhunia, S.: Security against hardware Trojan through a novel application
of design obfuscation. In: IEEE/ACM International Conference on Computer-Aided Design
(ICCAD), pp. 113–116. San Jose, California, 2–5 November 2009
14. Chakraborty, R.S., Bhunia, S.: Security against hardware Trojan attacks using key-based
design obfuscation. J. Electron. Test. 27, 767–785 (2011). Springer
15. Klapper, A., Goresky, M.: 2-Adic shift registers. In: Anderson, R. (ed.) FSE 1993. LNCS,
vol. 809. Springer, Heidelberg (1994)
16. Alkabani, Y.M., Koushanfar, F.: Active hardware metering for intellectual property
protection and security. In: Proceedings of USENIX Security Symposium, pp. 291–306.
Berkeley, CA, USA (2007)
17. Zou, J., Yu, W., Chen, Q.: Resolution of time-interval measurement based on chain delay
difference of FPGA (in Chinese). Opt. Optoelectron. Technol. 12(5), 43–45 (2014)
Secure and Efficient Protocol for Outsourcing
Large-Scale Systems of Linear Equations
to the Cloud
1 Introduction
With the rapid development of cloud services in business and scientific applica-
tion, it has become an increasingly important trend to provide service-oriented
computing service, which is more economical than traditional IT service. Com-
putation outsourcing enables a client with relatively weak computing power to
give out a hard computational task to some servers with more powerful comput-
ing power and sufficient computing resources. Then the servers are expected to
fulfill the computation and return the result correctly. Such a computing model
is especially suitable for cloud computing, in which a client can buy a com-
putation server from a cloud server provider in a pay-per-go [1] manner. This
computational framework enables customers receive the correct results, rather
than purchasing, provisioning, and maintaining their own computing resources.
c Springer International Publishing Switzerland 2015
Z. Huang et al. (Eds.): ICCCS 2015, LNCS 9483, pp. 25–37, 2015.
DOI: 10.1007/978-3-319-27051-7 3
26 C. Qian and J. Wang
3. Our protocol algorithm only requires one round communication between the
consumer and cloud servers, and the client can detect the misbehave of servers
with the probability 1. Further more, theoretical analysis and experimental
evaluation show that our protocol can be deployed in practical applications
immediately.
1.3 Organization
2 Our Framework
In this paper, the security threats are primarily come from the cloud servers.
There are two forms of threat models in outsourcing computations in general:
semi-honest model and fully malicious model. Goldreich et al. [17] firstly intro-
duced the semi-honest mode. In this model, both the parties are guaranteed
to properly execute a prescribed protocol, however, the cloud records all the
knowledge it can access, and attempts to try his best to retrieve some sensitive
information such as the secret input and output of client. Fully malicious model
is the strongest adversarial model. In many applications, we must consider the
case that servers are attackers. For example, servers can intentionally send a
computationally indistinguishable result to mislead client.
Hobenberger and Lysyanskaya [18] first introduced a model called “two
untrusted problem model”. In the two untrusted program model, there are two
non-colluding servers and we assume at most one of them is adversarial while
we cannot know which one. besides, the misbehavior of the dishonest server can
be detected with an overwhelming probability. Recently, this two non-colluding
cloud servers framework is employed in the previous scheme (e.g. [13,14]).
3 The Protocol
We have done extensive analysis of the existing techniques, and based on the
analysis, we propose a new secure protocol to secure outsource LSLE to the
cloud. Our approach is non-interactive and requires less computational effort
from the client side.
The input of original LSLE problem is a coefficient matrix A ∈ Rn×n and a coeffi-
cient vector b ∈ Rn such that Ax = b. The client with relatively weak computing
power wants to secure outsource the task of LSLE to the public cloud. The pro-
tocol invokes Algorithms 1 and 2 to set up the secret key.
Then, we describe the procedure LSLE encryption scheme with detailed tech-
niques.
Secure and Efficient Protocol for Outsourcing Large-Scale Systems 31
where A1 (i, j) = αi E1 (π1 (i), j), A2 (i, j) = (βi γj )P2 (π2 (i), j), A4 (i, j) =
θi E4 (π3 (i), j).
4: To additively split the original matrix A, (A = D + F ), where F = A − D.
5: The client computes D−1 Ax = D−1 b, where A = D+F , we can get (E +D−1 F )x =
D−1 b.
6: The original LSLE can be finally rewritten as A x = b , where A = E + D−1 F ,
b = D−1 b.
The computational complexity of A−1 1 and A−14 is O(n), then the calculation
−1 −1
of −A1 A2 A4 takes at most γn multiplications. That is, the computational
complexity is O(n), Thus, the client can efficiently computes D−1 . Obviously,
the D−1 is also a sparse matrix, since the constant γ n, the computational
complexity of D−1 F is O(n2 ). Hereafter, the proof is straightforward.
x = N (y1 +y2 ) = N (N −1 A−1 b1 +N −1 A−1 b2 ) = A−1 (b1 +b2 ) = A−1 b = A−1 b
(7)
Theorem 1. Input Privacy Analysis: In the fully malicious model, the protocol
is privacy for A and b.
Proof : From above protocol instantiation, we can see that throughout the
whole process, the cloud two servers only sees the disguised data (H, c1 ) and
(H, c2 ) respectively. We first prove the privacy for input b, for hiding the privacy
Secure and Efficient Protocol for Outsourcing Large-Scale Systems 33
Theorem 2. Output Privacy Analysis: In the fully malicious model, the protocol
is privacy for x.
Proof: The Server1 can not recover the value x (the solution of the original
LSLE problem Ax = b) by y1 (the solution to Hy1 = c1 ), and Server2 can not
recover the value x (the solution of the original LSLE problem Ax = b) by y2
(the solution to Hy2 = c2 ). With the assumption these two servers non-colluding
that the solution value x is well protected.
Theorem 3. Robust Cheating resistance: In the fully malicious model, the pro-
tocol is a 1-checkable implementation of LSLE.
Proof: Given a solution y, the client can verifies whether the equations Hy =
c hold efficiently because the computational complexity for Hy is O(n2 ). No
Invalid result from a malicious cloud server can pass the client’s verification with
probability 1.
Table 2. Notations
tclient
No Dimension Storage toriginal tclient tserver toriginal
For secure linear equation outsourcing, the experimental results are shown in
Table 3. All time is measured in seconds. In our experiment, the Guass-Jordan
elimination, the most commonly used schoolbook matrix computation algorithm
is employed by the cloud servers.
For the sake if completeness, we investigate the superiority of our algorithm
by making an experimental comparison with [12]. To give a fair comparison, we
implement Chen’ algorithm [12] on our workstation, and the experiment results
are shown in Table 4. As expected in Fig. 2, the experiment results shows that
our protocol is much efficient than that in Chen’s algorithm [12].
36 C. Qian and J. Wang
For our main concern on the client’s computational gain, Table. 3 indicates
that the gain is considerable and does increase with the dimension of the coef-
ficient matrix. In practice, there may be more performance gains. For example,
the cloud have more computational resources, e.g. memory. If the scale of the
LSLE problem becomes large, there will be a lot of input/ output operations
for matrix computations. For the client with poor memory, a lot of additional
cost is needed to move data in and out of the memory. However, the cloud can
handle this problem more easily with a large memory.
6 Conclusion Remarks
In this paper, we proposed an efficient outsource-secure protocol for LSLE, which
is one of the most basic and expensive operations in many scientific and business
applications. We have proved that our protocol can fulfills goals of correctness,
security, checkability, and high-efficiency. Our protocol is suitable for any nonsin-
gular dense matrix in the fully malicious model. Furthermore, the use of effective
additive splitting technique and two non-colluding cloud servers program model
enables our protocol more secure and efficient than Chen’ Algorithm [12].
References
1. Armbrust, M., Fox, A., Griffith, R.: A view of cloud computing. Commun. ACM
53(4), 50–58 (2010)
2. Benzi, M.: Preconditioning techniques for large linear systems: a survey. J. Comput.
Phy. 182(2), 418–477 (2002)
3. Brunette G., Mogull R.: Security guidance for critical areas of focus in cloud com-
puting v2. 1. Cloud Security Alliance, pp. 1–76 (2009)
4. Abadi M., Feigenbaum J., Kilian J.: On hiding information from an oracle. In:
Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing,
pp. 195–203. ACM (1987)
5. Gentry, C.: Fully homomorphic encryption using ideal lattices. STOC 9, 169–178
(2009)
6. Rivest, R.L., Adleman, L., Dertouzos, M.L.: On data banks and privacy homomor-
phisms. Found. Sec. Comput. 4(11), 169–180 (1978)
7. Benjamin, D., Atallah, M.J.: Private and cheating-free outsourcing of algebraic
computations. In: Sixth Annual Conference on Privacy, Security and Trust, PST
2008, pp. 240–245. IEEE (2008)
8. Blanton, M., Atallah, M.J., Frikken, K.B., Malluhi, Q.: Secure and efficient out-
sourcing of sequence comparisons. In: Foresti, S., Yung, M., Martinelli, F. (eds.)
ESORICS 2012. LNCS, vol. 7459, pp. 505–522. Springer, Heidelberg (2012)
Secure and Efficient Protocol for Outsourcing Large-Scale Systems 37
9. Urs, K.M.R.: Harnessing the Cloud for Securely Outsourcing Large-scale Systems
of Linear Equations
10. Saad, Y.: Iterative methods for sparse linear systems. Siam (2003)
11. Chen, F., Xiang, T., Yang, Y.: Privacy-preserving and verifiable protocols for sci-
entific computation outsourcing to the cloud. J. Parallel Distrib. Comput. 74(3),
2141–2151 (2014)
12. Xiaofeng, C., Xinyi, H., Jin, L., Jianfeng, M., Wenjing, L., Wong D.: New Algo-
rithms for Secure Outsourcing of Large-scale Systems of Linear Equations (2015)
13. Atallah, M.J., Frikken, K.B.: Securely outsourcing linear algebra computations. In:
Proceedings of the 5th ACM Symposium on Information, Computer and Commu-
nications Security, pp. 48–59. ACM (2010)
14. Elmehdwi, Y., Samanthula, B.K., Jiang, W.: Secure k-nearest neighbor query over
encrypted data in outsourced environments. 2014 IEEE 30th International Con-
ference on Data Engineering (ICDE), pp. 664–675. IEEE (2014)
15. Gibson, S.F.F., Mirtich, B.: A survey of deformable modeling in computer graphics.
Technical report, Mitsubishi Electric Research Laboratories (1997)
16. Oppenheim, A.V., Willsky, A.S., Nawab, S.H.: Signals and systems. Prentice-Hall,
Englewood Cliffs (1983)
17. Goldreich O., Micali S., Wigderson A.: How to play any mental game. In: Proceed-
ings of the Nineteenth Annual ACM Symposium on Theory of Computing, pp.
218–229. ACM (1987)
18. Hohenberger, S., Lysyanskaya, A.: How to securely outsource cryptographic com-
putations. In: Kilian, J. (ed.) TCC 2005. LNCS, vol. 3378, pp. 264–282. Springer,
Heidelberg (2005)
19. Goldreich, O.: Computational complexity: a conceptual perspective. ACM
SIGACT News 39(3), 35–39 (2008)
A Provably Secure Ciphertext-Policy
Hierarchical Attribute-Based Encryption
1 Introduction
Much attention has been payed to information security issues, with the increasing
development of network technology. Traditional PKC ensures data confidentiality, but
limits the flexibility of access control [1]. Sahai and Waters [2] proposed the concept of
attribute-based encryption (ABE) for a more flexible and fine-grained access control.
Sets of descriptive attributes are used to describe files and users in ABE. A certain user
can decrypt his ciphertext, when the concerned access control policy is satisfied by his
attributes. ABE can be divide into 2 categories: key-policy (KP-)ABE and ciphertext-
policy (CP-)ABE. The system labels each ciphertext in KP-ABE with the descriptive
attributes sets; access control policies are related to the secret key of the user. CP-ABE
is similar to KP-ABE, except that a secret key is related to the user’s attributes and the
system labels access control policies with each ciphertext.
1.1 Motivation
However, the hierarchy relationships among the attributes in the same category are
often overlooked. For example, according to the Academic Title Evaluation, university
© Springer International Publishing Switzerland 2015
Z. Huang et al. (Eds.): ICCCS 2015, LNCS 9483, pp. 38–48, 2015.
DOI: 10.1007/978-3-319-27051-7_4
A Provably Secure Ciphertext-Policy Hierarchical 39
www.allitebooks.com
40 Z. Wang and J. Wang
is on behalf of access control policies. Balu and Kuppusamy [15] adopted Linear
Integer Secret Sharing (LISS) so that it can represent access control policies, imple-
menting the logic operations between the access control policies, making a more
flexible access control structure.
The term of HABE appears in several recent schemes in different senses. In 2011,
the notion of hierarchical ABE (HABE) [3] was put forward, after which a lot of
schemes [4–6] based on HABE are proposed. Recently, a new HABE scheme in [7] has
been proposed which aims mainly to solve delegation problems.
2 Preliminaries
The algorithm B, which can output z 2 f0; 1g, has the advantage e to handle
decisional l-parallel BDHE problem in G.
A Provably Secure Ciphertext-Policy Hierarchical 41
(1) Rule 1
Mu can express any variable Ri in the access control policy P.
(2) Rule 2
For any OR-term P ¼ Pa _ Pb , Ma 2 Z da ea and Mb 2 Z db eb express the formula
Pa and Pb respectively. P is expressed by a matrix MOR 2 Z ðda þ db Þðea þ eb 1Þ con-
structed by
ca ra 0
MOR ¼
cb 0 rb
(3) Rule 3
For any AND-term P ¼ Pa ^ Pb , Ma 2 Z da ea and Mb 2 Z db eb express the for-
mula Pa and Pb respectively. A matrix MAND 2 Z ðda þ db Þðea þ eb Þ expresses P and is
constructed by
ca ca ra 0
MAND ¼
0 cb 0 rb
KeyGen (A0 ; params; mk): Input A0 , which is a set of given attributes, the
expressions for master key mk as well as for the public parameters params, then select
t 2 Zp randomly then perform the following:
(1) Compute d0 ¼ gt .
Q t
(2) For each attribute a 2 A0 , 1 i d, compute d1 ¼ ga gat hi ed¼1 ud ; d2 ¼ hti ,
(3) The secret key is dA0 ¼ d0 ; fd1 ; d2 ga2A0 .
A Provably Secure Ciphertext-Policy Hierarchical 43
Enc (m; P; params): Input the access control policy P, a message m 2 G1 to encrypt
and the public parameters params.
(1) Choose a random element s 2 ½2L ; 2L . The above method for the access control
policy P constructs distribution matrix M. A is a set containing attributes in the
access control policy P. Choose q ¼ ðs; q2 ; . . .; qe ÞT , in which q0i s are uniformly
randomly selected integers in ½2L0 þ k ; 2L0 þ k . Compute M q ¼ ðs1 ; . . .sd ÞT .
(2) Compute CT1 ¼ meðg; gÞas ; CT2 ¼ gs .
Q s
(3) For each attribute in P, compute CT3 ¼ hi eh uh ; CT4 ¼ gasi hs i ,1 i d.
Dec (CT; params; dA0 ): Input a ciphertext CT and the secret key for set A ðA A0 Þ.
The algorithm begins to match attributes first. Traverse attribute path set RA in
which attributes belong to user attribute set A and attribute path set RP in access
control policy P. If there is a path present in RA which does not cover paths of RP , the
ciphertext cannot be decrypted.
Assume that all attribute paths of the set A cover those in the access control policy
P, that is to say, A meets the access control policy P, then there exists a vector
P
kA 2 Z dA that satisfies MAT kA ¼ n. Reconstruct the secret using i2A ki si ¼ s, the
decryption algorithm calculates
Q n o
eðCT3 ; d0 Þ eðCT2 ; d2 Þ ½eðCT4 ; d0 Þki
i2A
CT1 Q
i2A eðCT2 ; d1 Þ
Q n Qe s t s t as s t ki o
i2A e hi h uh ; g e g ; hi e g i hi ; g
as
¼ meðg; gÞ Q Qe t
s a at h
i2A e g ; g g i d¼1 ud
Q Qe st s t as k s t
i2A e hi h uh ; g e g ; hi e g hi ; g
i i
as
¼ meðg; gÞ Q Qe st
s a
eðg ; g Þ eðg ; g Þ i2A e g; hi d¼1 ud
s at
Q s t as s t
i2A e g ; hi e g hi ; g
¼m
eðg; gÞast
Q s t s t
i2A e g ; hi eðg ; g Þ e hi ; g
as t
¼m
eðg; gÞast
Q st st
eðg; hi Þ eðhi ; gÞ eðg; gÞast
¼ m i2A ast
eðg; gÞ
¼m
44 Z. Wang and J. Wang
4 Security Analysis
4.1 Security Model
Refering to [14], we firstly give our CP-HABE security model. We model the semantic
security model opposed to chosen-plaintext attack (CPA) based on selective attribute
(sAtt) model. We now present a game, which is played in two players: a challenger
C plus an adversary A. This game defines the concept of semantic security in the
CP-HABE scheme.
Init: Firstly A chooses the challenge access control policy C , which is given to C.
Setup: With a security parameter 1k chosen by C and is large enough, C can
calculate the algorithm Setup to acquire mk, the master key, as well as params, the
public parameters. C gives A params but maintain mk herself.
Phase 1: A requests for a bounded number in polynomial of any attribute set Ai not
satisfying C . C returns KeyGen.
Challenge: A submits m1 and m0, two messages having the same length. Ran-
domly, C chooses b 2 f0; 1g, then encrypts mb over C . After that, the challenge
ciphertext CT is sent to A.
Phase 2: Similar to Phase 1.
Guess: A gives b0 2 f0; 1g, a speculation of b. A beats C if b ¼ b0 .
A’s advantage is
Security proof.
Theorem 1. No adversary is able to break our construction selectively in polynomial
time if the decisional l-parallel BDHE assumption holds.
The above theorem can ensure the security of our CP-HABE scheme.
Proof. In the security game, Assume that we have an adversary A with a
non-negligible advantage e ¼ AdvCPHABE;A . Below we will show you how to use
adversary A to construct B, which is a simulator playing the decisional l-parallel BDHE
problem.
i
Init: Given ðG; G1 ; e; g; gs ; y1 ; y2 ; . . .; yl ; yl þ 2 ; . . .; y2l ; TÞ, in which yi ¼ ga , then
we let Y ¼ ðg; gs ; y1 ; y2 ; . . .; yl ; yl þ 2 ; . . .; y2l Þ. A chooses a challenge access control
policy ðM ; q Þ, in which M is a d e matrix, e l, and B receives it.
0
Setup: B chooses at random a0 2 Zp , let W ¼ eðy1 ; yl Þ eðg; gÞa ¼ e ga ; ga
l
0 0 lþ1
eðg; gÞa ¼ eðg; gÞa þ a . For each ai for 1 ai U , it firstly selects a value Zi 2 Zp
randomly, and U being the sets of attributes in the access control policy. If ai 62 P , set
Q M
hi ¼ gzi ej¼1 yj i;j , otherwise let hi ¼ gzi . For 1 k e , it chooses at random d0 ; d1 ;
A Provably Secure Ciphertext-Policy Hierarchical 45
. . .; de 2 Zp , let uk ¼ gdk y1 lk þ 1 . B sends public parameters params ¼ ðG; G1 ; g; e; W;
0 1þl
ðhi Þ1 i d ; uj 1 j e Þ to A, while keeping the master key mk ¼ ga þ a for itself.
Phase 1: A requests for a number bounded in polynomial of any attribute set A* not
meeting the needs of challenge access control policy ðM ; P Þ in this phase. By the
definition of LSSS, there must exist a vector K ¼ ðk1 ; k2 ; . . .; ke Þ 2 Z e such that
P
M K ¼ 0 with k1 ¼ 1. B selects t0 2 Zp randomly, sets t ¼ t0 þ ej¼1 kj alj þ 1 , then
computes:
Pe lj þ 1 Y
t0 þ e 0
ylj þ 1 gt
ka
d0 ¼ gt ¼ g j¼1 j ¼ j¼1
Ye t
d1 ¼ ga gat hi d¼1 ud
P
0 e
k alj þ 1
Ye t0 þ Pe kj alj þ 1
a0 þ al þ 1 a t þ j¼1 j
¼g
j¼1
g hi d¼1 ud
Pe lj þ 1 Y
e
t0 Ye Pe kj alj þ 1
a0 al þ 1 at0 a kj a
¼g g g g
j¼1
j¼1 hi d¼1 ud hi d¼1 ud
Pe lj þ 2 Y Y Pe kj alj þ 1 Ye t 0
a0 al þ 1 at0 a l þ 1 kj a e Mi;j e d 1
¼g g g g z j¼1
g j¼2 g i
j¼1 j
y k¼1
g k
y lk þ 1 h i d¼1
u d
Pe lj þ 1 Pe lj þ 1
0 0
Ye k k
j¼1 j
a Ye ka Ye t 0
j¼1 j
¼ ga gat yj
j¼2 lj þ 2
gzi k¼1
gdk y1 lk þ 1 hi d¼1 ud
Ye Ye Pe kj alj þ 1 Ye t 0
0 0 kj ðzi þ 1Þ al zi
¼ ga gat dk 1 j¼1
j¼2
y lj þ 2 g k¼1
g y lk þ 1 h i d¼1
u d
" #
Ye Y e Pe kj alj þ 1 Ye t 0
kj ðzi þ 1Þ a zi l
dk 1 0
a at 0
¼
j¼1
y
j¼2 lj þ 2
g k¼1
g y lk þ 1 g g h i d¼1
u d
Pe Pe
Ye t 0 þ lj þ 1 Ye k alj þ 1
Mi;j ka
j¼1 j M j¼1 j 0
d2 ¼ hti ¼ g zi
j¼1
y j ¼ g zi
y i;j
j¼1 j
hti
Pe lj þ 1 0
ka
j¼1 j
ht
i Ye zk 0
¼g zi
¼ y i j ht
j¼1 lj þ 1 i
B sends the secret key d ¼ d0 ; fd1 ; d2 ga2A to A.
Challenge: Two messages m0 ; m1 2 G1 are submitted by A. B chooses b 2 ð0; 1Þ,
and encrypts mb for 1 i d as follows:
0
Ye s
CT1 ¼ mb Te gs ; ga ; CT2 ¼ gs ; CT3 ¼ hi h uh
Intuitively, B then chooses uniformly random integers q02 ; . . .; q0e 2 2L0 þ k ;
2L0 þ k and shares the secret s using the vector q ¼ s; sa þ q02 ; . . .; sae 1 þ q0e with
q01 ¼ 0.
P
Let si ¼ M q ¼ ej¼1 saj1 Mi;j
þ q0j Mi;j
, then the corresponding challenge
ciphertext is computed as follows:
46 Z. Wang and J. Wang
Pe j1 0
CT4 ¼ gasi hs
a ðsa Mi;j þ qj Mi;j Þ hs
i ¼g
j¼1
i
Ye j 0
Ye M s
¼ j¼1
gsa Mi;j gaqj Mi;j gzi y i;j
j¼1 j
Ye j 0
Ye sM
¼ j¼1
gsa Mi;j gaqj Mi;j ðgzi Þs j¼1 yj i;j
Ye saMi;j 0
Ye sM
¼ y gaqj Mi;j ðgzi Þs j¼1 yj i;j
j¼1 j
0
¼ gaqj Mi;j ðgzi Þs
A receives the challenge ciphertext CT ¼ CT1 ; CT2 ; fCT3 ; CT4 ga2A .
Phase 2: Similar to Phase 1.
lþ1
Guess: Finally, A give b0 . B outputs 0 to assume that T ¼ eðgs ; yl þ 1 Þ ¼ eðg; gÞsa
0
if b ¼ b; or else outputs 1 in order to indiciate that it considers T as a random element
in group G1 .
0
B shows a perfect simulation when CT ¼ m Te gs ; ga ¼ m eðg; gÞsða þ a Þ ¼
0 lþ1
1 b b
mb eðg; gÞas and T ¼ eðgs ; yl þ 1 Þ, then the advantage is Pr½BðY; eðgs ; yl þ 1 ÞÞ ¼ 0
¼ 12 þ AdvCPHABE;A . Otherwise the advantage is Pr½BðY; RÞ ¼ 0 ¼ 12 if T is a random
group element R in G1.
All the analysis shows that
1 1
jPr½BðY; eðgs ; yl þ 1 ÞÞ ¼ 0 Pr½BðY; RÞ ¼ 0j ð
eÞ ¼ e
2 2
5 Performance Evaluation
Comparing with the scheme in [7], the following tables show the results based on the
comparsion of storage cost and computation cost respectively.
Table 1. Comparsion of storage cost
Scheme Public parameters Secret key Ciphertext
[7] scheme ðL þ D þ 3Þl1 þ l2 ½ðl k þ 1ÞjSj þ 2l1 ð3l þ 2Þl1
Our scheme ðL þ D þ 2Þl1 þ l2 ð2jSj þ 1Þl1 ð2l þ 2Þl1
In these two tables, Land D indicate the number of rows and number of columns of
the attribute matrix in [7] scheme respectively, the maximum depth and the number
of the attribute trees in our scheme. k is the depth of the user (k L),l1 ; l2 the bit lengths
of the elements of group G; G1 , respectively, and l and n the numbers of rows and
columns, respectively, of the share-generating matrix. Exponentiation, choosing a
random element and computing a bilinear pairing are denoted by se ,sr and sp . jSj and
jS j are the number of attribute vectors of the set meeting the access structure related to
a ciphertext as well as the number related to a user, respectively.
A Provably Secure Ciphertext-Policy Hierarchical 47
We give the comparison of the scheme in [7] with our scheme in terms of storage
cost and computation cost in Tables 1 and 2. Our CP-HABE scheme has significantly
better performance than the scheme in [7].
Additionally, the scheme in [7] is essentially a hierarchical identity-based
encryption scheme. However, our scheme is a truly hierarchical attribute-based
encryption scheme, which can reflect the hierarchy relationships among the attributes in
the same category, so that it can achieve the purpose of simplifying access control
rules.
What’s more, access control structure in our CP-HABE scheme can reflect the
logical relationships between access control policies, which is more expressive.
6 Conclusion
Acknowledgements. This work is partly supported by the Fundamental Research Funds for the
Central Universities (No. NZ2015108), and the China Postdoctoral Science Foundation funded
project (2015M571752), and the Jiangsu Planned Projects for Postdoctoral Research Funds
(1402033C), and Open Project Foundation of Information Technology Research Base of Civil
Aviation Administration of China(NO.CAAC-ITRB-201405).
References
1. Guo, L., Wang, J., Wu, H., Du, H.: eXtensible Markup Language access control model with
filtering privacy based on matrix storage. IET Commun. 8, 1919–1927 (2014)
2. Sahai, A., Waters, B.: Fuzzy identity-based encryption. In: Cramer, R. (ed.) EUROCRYPT
2005. LNCS, vol. 3494, pp. 457–473. Springer, Heidelberg (2005)
3. Li, J., Wang, Q., Wang, C., Ren, K.: Enhancing attribute-based encryption with attribute
hierarchy. Mob. Netw. Appl. 16, 553–561 (2011)
4. Wang, G., Liu, Q., Wu, J., Guo, M.: Hierarchical attribute-based encryption and scalable
user revocation for sharing data in cloud servers. Comput. Secur. 30, 320–331 (2011)
5. Wan, Z., Liu, J.E., Deng, R.H.: HASBE: a hierarchical attribute-based solution for flexible
and scalable access control in cloud computing. IEEE Trans. Inf. Forensics Secur. 7,
743–754 (2012)
48 Z. Wang and J. Wang
6. Liu, X., Xia, Y., Jiang, S., Xia, F., Wang, Y.: Hierarchical attribute-based access control with
authentication for outsourced data in cloud computing. In: 2013 12th IEEE International
Conference on Trust, Security and Privacy in Computing and Communications (TrustCom),
pp. 477–484. IEEE (2013)
7. Deng, H., Wu, Q., Qin, B., Domingo-Ferrer, J., Zhang, L., Liu, J., Shi, W.: Ciphertext-policy
hierarchical attribute-based encryption with short ciphertexts. Inf. Sci. 275, 370–384 (2014)
8. Goyal, V., Pandey, O., Sahai, A., Waters, B.: Attribute-based encryption for fine-grained
access control of encrypted data. In: Proceedings of the 13th ACM Conference on Computer
and Communications Security, pp. 89–98. ACM (2006)
9. Bethencourt, J., Sahai, A., Waters, B.: Ciphertext-policy attribute-based encryption. In:
IEEE Symposium on Security and Privacy, 2007, SP 2007, pp. 321–334. IEEE (2007)
10. Cheung, L., Newport, C.: Provably secure ciphertext policy ABE. In: Proceedings of the
14th ACM Conference on Computer and Communications Security, pp. 456–465. ACM
(2007)
11. Goyal, V., Jain, A., Pandey, O., Sahai, A.: Bounded ciphertext policy attribute based
encryption. In: Aceto, L., Damgård, I., Goldberg, L.A., Halldórsson, M.M., Ingólfsdóttir, A.,
Walukiewicz, I. (eds.) ICALP 2008, Part II. LNCS, vol. 5126, pp. 579–591. Springer,
Heidelberg (2008)
12. Kapadia, A., Tsang, P.P., Smith, S.W.: Attribute-based publishing with hidden credentials
and hidden policies. In: NDSS, pp. 179–192 (2007)
13. Boneh, D., Waters, B.: Conjunctive, subset, and range queries on encrypted data. In:
Vadhan, S.P. (ed.) TCC 2007. LNCS, vol. 4392, pp. 535–554. Springer, Heidelberg (2007)
14. Waters, B.: Ciphertext-policy attribute-based encryption: an expressive, efficient, and
provably secure realization. In: Catalano, D., Fazio, N., Gennaro, R., Nicolosi, A. (eds.)
PKC 2011. LNCS, vol. 6571, pp. 53–70. Springer, Heidelberg (2011)
15. Balu, A., Kuppusamy, K.: An expressive and provably secure ciphertext-policy
attribute-based encryption. Inf. Sci. 276, 354–362 (2014)
16. Boneh, D., Franklin, M.: Identity-based encryption from the weil pairing. In: Kilian, J. (ed.)
CRYPTO 2001. LNCS, vol. 2139, pp. 213–229. Springer, Heidelberg (2001)
17. Damgård, I.B., Thorbek, R.: Linear integer secret sharing and distributed exponentiation. In:
Yung, M., Dodis, Y., Kiayias, A., Malkin, T. (eds.) PKC 2006. LNCS, vol. 3958, pp. 75–90.
Springer, Heidelberg (2006)
Privacy-Preserving Multidimensional
Range Query on Real-Time Data
1 Introduction
Motivation: In some data outsourcing scenarios, the data owner (DO) may store its
real-time data that are periodically submitted by its subordinate data collectors (DCs) on
cloud service provider (CSP) in encrypted form. For a typical example, Fig. 1 shows a
radiation detection scenario where DCs which manage sensing devices periodically
upload the data they collected to CSP, and the data they collected are related with three
dimensions: the time, the geographic location of the road (denoted by GPS coordinates)
and the local radiation level. Environmental Protection Agency (DO) sometimes may
query the CSP to get radiation levels of roads. Moreover, there are many other similar
scenarios, such as traffic accident monitoring system where Traffic Management Agency
often requests pictures of accidents stored on CSP. In these scenarios, the privacy of
those data which DCs upload to the CSP should be respected. We note that CSPs in these
scenarios is semi-trusted and curious. The CSP can correctly implement various oper-
ations by rules but may omit some query results and it is curious about the data they store.
In addition, the CSP may omit some query results.
We aim to realize multidimensional range query on real time data with privacy-
preserving in the above scenarios. Generally, it is necessary to store these sensitive data
in encrypted form and the keys are kept secret from CSP all the time. In the trivial
approach, DCs upload data to DO in real time first. Then DO encrypts the data and
generates indexes itself and finally uploads them to CSP. Also, DO may sometimes
query the data that are stored in CSP. Note that symmetric cryptosystem, such as AES,
is used in the trivial approach, and the key keeps unchanged.
Since data that DCs upload to DO is real-time, the trivial approach is very ineffi-
cient. DO needs to frequently upload newly collected data to the CSP and frequently
updates index structure on the CSP, which lead to very high cost at DO.
For the scenario mentioned above, we propose a scheme to realize privacy-
preserving multidimensional range query based on bucketization scheme on real-time
data for cloud storage. Since we are dealing with real time data, we divide time into N
epochs and adopt an asymmetric key-insulated method to automatically update the key
at each epoch. Thus DO only need to issue keys to DC every N epochs and query the
CSP when necessary. DCs directly upload the collected real time data and indexing
information to the CSP. Furthermore, since the CSP is semi-trusted and may submit
forged or incomplete query results to DO, it is necessary to guarantee the integrity of
query result.
Our Contributions: For the scenario mentioned above, we propose a scheme to realize
privacy-preserving multidimensional range query based on bucketization scheme on
real-time data for cloud storage. To summarize, the contributions of this paper are:
(1) We divide time into N epochs and adopt key-insulated technology which is based
on public key cryptosystem and supports periodical key update to bucketization
method. Compared with the trivial approach, there are two advantages: (i) DCs
undertake most of the work including collecting data, generating indexes and uploading
data and thus radically reduce the cost of DO in the trivial approach. (ii) Key distri-
bution is simple, because keys of each epoch can be calculated from keys of the
Privacy-Preserving Multidimensional Range Query on Real-Time Data 51
previous epoch and the cycle for key distribution are N epochs where N is a parameter
of the key-insulated technology. (2) Our scheme supports the integrity verification of
query result, and can verify whether CSP who is semi-trusted omits some query results
or not and thereby avoid the query-result incompleteness.
2 Related Work
(1) Methods that use some specialized data structures [1–3]: The performance of these
methods often gets worse as the dimension of data increases. Shi et al. [1] proposes a
binary interval tree to represent ranges efficiently and applies it to multidimensional
range queries, but their scheme cannot avoid unwanted leakage of information because
of adapting one-dimensional search techniques to the multidimensional case. A binary
string-based encoding of ranges is proposed by Li, J. et al. in [3]. However, since each
record in their scheme has to be checked by the server in query processing, the access
mechanism of their scheme is inefficient. (2) Order-preserving encryption-based
techniques (OPE) [4, 5]: the order of plaintext data in OPE is preserved in cipher text
domain. But it is susceptible to statistical attacks as the encryption is deterministic (i.e.,
the encryption of a given plaintext is identical, thus the frequency of distinct value in
the dataset often be revealed). (3) Bucketization-based techniques [6–8]: Since it
partitions and indexes data by simple distributional properties, it can realize efficient
query and keep the information disclosure to a minimum. Hacigumus et al. [6] first
propose the bucketization-based representation for query processing, and afterward
many researchers adopted Bucketization-based techniques for multidimensional range
queries [7–10].
Where t denotes the interested epoch (t 2 ½1; N ), ½la ; lb denotes the wanted range
of attribute Aj . Moreover, this type of range query can be extended to the type of
multiple range queries that involve several epochs and the union of several attributes.
At the end of each epoch, each DC generates an encryption key by its master
encryption key to encrypt the data it collected in that epoch. In addition to partitioning
the data to buckets by bucketization-based method, each DC also generates some
verifying numbers for empty buckets (the buckets that have no data in them) which are
used in our integrity verification processing of query result. The data which consists of
bucket IDs, encrypted data and verifying numbers are finally uploaded to the CSP by
each DC at each epoch. Detailed data processing in each epoch is provided in Sect. 3.3.
DO sometimes needs to query the data stored on the CSP. DO first translates
plaintext queries to bucket IDs and sends them to the CSP. Then the CSP returns a set
of encrypted results, it also returns a proof that is generated using corresponding
verifying numbers. Once DO receives the results, it first verifies query-result integrity
by the proof. If it succeeds, DO will update the decryption key(s) by interacting with
the key-update device to decrypt the encrypted results and then filter them to get
wanted query results. Note that in our scheme, initial decryption keys are used only one
time when DO generates decryption keys of the first epoch (t ¼ 1) and DO can update
the to the decryption keys of any desired epoch on “one shot”. Detailed query pro-
cessing is shown in Sect. 3.4.
• We denote by X the set of all buckets including non-empty buckets and empty
buckets that DCi generated in epoch t.
• We define the number of non-empty buckets DCi generated in epoch t with Yi;t , the
non-empty
buckets DCi generated
in epoch t are denoted by
Bi;t ¼ b1 ; b2 ; ; bYi;t X; j 2 1; Yi;t .
At the end of each epoch, each DC generates the corresponding encryption key for the
epoch by its master encryption key and uses it to encrypt all the non-empty buckets that
have data in them. Let`s consider the encryption
work that DCi does at the end of epoch t.
Given master encryption key PKi g; h; z0 ; ; zm and epoch value t, DCi does:
Q tl
Step 1. It computes PKi;t ¼ m l¼0 zl ;
Step 2. It selects r 2 Zq at random, and computes the cipher text of data Dj falling
r
into the bucket bj : Dj PKi;t ¼ gr ; hr ; PKi;t Dj .
Although the above encryption method can ensure data confidentiality (as the CSP
does not know decryption key), the CSP may still omit some data which satisfy the
query, leading to query-result incompleteness. Our solution is that DCi generates
verifying number numðbk ; i; tÞ for each empty bucket bk 2 XnBi;t at the end of epoch
t:numðbk ; i; tÞ ¼ ha i k t k bk k PKi;t , where ha ðÞ denotes a hash function of a bits.
Finally, DCi uploads to the CSP all encrypted non-empty buckets and verifying
numbers with their respective bucket IDs as follows:
n o
i; t; bj ; Dj PKi;t jbj 2 Bi;t ; bk ; numðbk ; i; tÞjbk 2 XnBi;t :
0
value t and the master key SKi ¼ x1 ; y1 ; ; xm ; ym and finally outputs partial
0 0 P
0
decryption key SKi;t 0 ¼ xt 0 ; yt 0 by computing x0t0 m x
l¼1 l
ð t 0 l
Þ t l
and y0t0
Privacy-Preserving Multidimensional Range Query on Real-Time Data 55
Pm
0 l
y
l¼1 l ð t Þ t l
. We assume that the exposure of partial decryption key occurs less
likely than the exposure of decryption key.
0
Second, DO uses the partial decryption key SKi;t 0 to generate the decryption key
SKi;t0 by the DO key-update algorithm DOKU t0 ; SKi;t ; SKi;t 0
0 ! SKi;t0 , which
inputs epoch value t0 , the decryption key of epoch t SKi;t ¼ ðxt ; yt Þ and the partial
0
0 0
decryption key SKi;t 0 ¼ xt 0 ; yt 0 and finally outputs the decryption key SKi;t0 ¼ ðxt0 ; yt0 Þ
by computing xt0 ¼ xt þ x0t0 and yt0 ¼ yt þ y0t0 .
Where hb ðÞ is a hash function with b bits. And afterwards the CSP returns query
results as follows:
n o
i; t; bj ; Dj PKi;t jbj 2 Qt \Bi;t ; bk ; NUMQt jbk 2 Qt \XnBi;t ; i 2 ½1; n; t 2 ½1; N :
Since DO knows PKi , it can compute all the corresponding verifying numbers and
then computes a NUMQ0 t . If NUMQ0 t ¼ NUMQt , DO considers that CSP did not omit
query results (otherwise it did) then uses.corresponding decryption key(s) to decrypt
the results by computing Di;j Dj PKi;t grxt hryt . We note that DO can query the data
that are uploaded by several DCs in several epochs.
However, the query results always contain some superfluous data items (false
positives) that DO does not really want (as the interested ranges may not exactly span
full buckets). Using finer buckets can reduce such false positives, but this brings the
problem that the data distribution may be more accurately estimated by adversary and
thus increase the risk of information disclosure in bucketization. For this problem, we
can adopt the maximum entropy principle to each bucket to uniform the distribution of
sensitive attributes in each bucket, whereby minimize the risk of disclosure. And we
also can refer to [15, 16] for optimal bucketing strategies which can achieve a good
balance between reducing false positives and reducing the risk of information
disclosure.
56 Z. Ting et al.
4 Security Analysis
4.1 Analysis on Integrity Verification
As we discussed above, the CSP may omit some query results, leading to incom-
pleteness of query. Let`s consider the probability that the misbehavior of the CSP can
be detected. We assume that each of a non-empty bucket is queried with probability c
and omitted with probability d, so the total number of omitted buckets generated by all
DCs is acd. To escape the detection by DO, the CSP must return a correct NUMQ0 t
corresponding to the incomplete query results. And the probability of guessing a
correct NUMQ0 t is 2a . So the probability that the misbehavior of the CSP can be
detected is:
Pdet ¼ 1 2aacd :
It is more likely to successfully detect the misbehavior of the CSP as aacd is larger.
into the insecure storage of DO in key updating step (i.e., between epoch t 1 and
0
epoch t) to get SKi;t1 , SKi;t so as to generate SKi;t ; (3) Master key exposure: it occurs
when adversary compromise the key-update device (i.e., leakage of SKi ).
5 Experiment
To evaluate the performance of our scheme, we compare it with the trivial approach
described in Sect. 1. We assume that DO in the trivial approach adopts AES-128
algorithm and the same bucketization-based methods as in our scheme. In both
approaches, time is divided into epochs. The cost of generating bucket IDs in both
schemes can be ignored due to its simplicity. Let k be the total number of buckets in
each epoch and k be the percentage of non-empty buckets among all buckets. We
denote the probability of each bucket being queried by c. Since the duration for each
epoch is not long and the data collected in each epoch and each bucket won’t be much,
we suppose that the size of data in each non-empty bucket and each epoch is 4096bit
for simplicity.
Primitive Operations: The main (primitive) operations used in our scheme
including
(1) Modular exponentiation;
(2) Modular multiplication;
(3) SHA-256 computation of 1024-bit integers.
The time needed for these operations was benchmarked in [18] as following:
(1) The average time for computing the power of a 1024-bit number to a 270-bit
exponent and then reducing modulo was found to be t1 ¼ 1:5 ms;
(2) Multiplication of 270-bit numbers modulo was found to be t2 ¼ 0:00016 ms;
(3) The average time to compute the 256-bit digest of a 1024-bit number was found to
be t3 ¼ 0:01 ms.
For efficiency, we can adopt AES-128 algorithm to encrypt the data in buckets and
then use the keys generated by (m; N)-key-insulated method to encrypt symmetric keys
in our scheme. Since AES-128 algorithm in the trivial approach and our scheme is
efficient and the amount of data that are encrypted using AES-128 algorithm is same,
we do not consider the cost of AES-128 algorithm when comparing the performance of
the trivial approach and our scheme.
Next, we will compare the cost of the trivial approach and our scheme, including
cost of the system setup, data processing work in each epoch and query processing.
t00 is the number of queried epochs.
Table 1. The comparison between the trivial approach and our scheme in each epoch
DO in trivial DO in our DC in our
approach scheme scheme
In system Generating keys Oð1Þ OðmnÞ \
setup
In each Encryption Oðkk Þ \ Oðk=nÞ
epoch Key update \ \ OðmÞ
Generating \ \ Oðk ð1 kÞ=nÞ
verifying
numbers
Communication O ðk Þ \ Oðk=nÞ
cost
In query Key update \ Oðmt00 Þ \
processing Decryption Oðck Þ Oðck Þ \
Verification \ Oðk ð1 kÞc=nÞ \
our scheme is acceptable for the following two reasons: (i) The cost is linearly related
to m which can be very small even for high security requirements. (ii) DO only needs to
generate these keys for every N epochs.
Data Processing in Each Epoch. DO in the trivial approach needs to encrypt and
generate bucket IDs for the data that are uploaded by all DCs, and it also need
frequently upload these encrypted data and bucket IDs to CSP. The communication
cost of DO would be high for real-time data. However, in our scheme, each DC
undertakes these works for its data and DO don’t need to do anything for data pro-
cessing. Additionally, each DC in our scheme needs to generate its encryption key for
current epoch and the time for key generation is 2ðm þ 1Þt1 þ mt2 (it is acceptable for
the small value of m). And each DC in our scheme also generates verifying numbers for
empty buckets and this is to improve security against semi-trusted CSP which the
trivial approach doesn’t consider. The averaged time that each DC in our scheme
encrypts kk=n non-empty data buckets and generates verifying numbers for ð1 kÞk=n
empty buckets in each epoch is ð3t1 þ t2 Þkk=n þ ð1 kÞkt3 =n. It is worthwhile for DCs
in our scheme to spend additional time for verifying numbers generation and
encryption key update because this also improves security of the system. And the total
time of data processing in each epoch for each DC is as follows:
Results are showing in Fig. 3. There are four lines that denote the time for data
upload work with k ¼ 1000, k ¼ 2000, k ¼ 3000 and k ¼ 4000, respectively. The time
is increasing as k increasing. Experiment results show that the cost of data processing
work in each epoch for each DC is acceptable.
Fig. 3. The time for data processing work in each epoch for each DC with different amount of
buckets (n ¼ 100, m ¼ 5).
But these only occur when DO wants to query the data. Here, let`s first consider the
case that DO queries the data that are uploaded by one DC in one epoch. The query
results contain kkc=n non-empty buckets, and the time for decrypt the query results is
2kckðt1 þ t2 Þ=n. Moreover, the time for key update is 4mðt1 þ t2 Þ, and the time for
computing NUMQt is ð1 kÞckt3 =n. So the total time is below:
Results are shown in Fig. 4. There are four lines that denote the time for querying
the data which are submitted by DCi in one epoch with k ¼ 1000, k ¼ 2000, k ¼ 3000
and k ¼ 4000, respectively. Obviously, the query time increases with the number of
buckets. In actual scenarios, such as radiation detection and traffic monitoring, DO
often queries the data that are collected by sensing devices in specific area and limited
number of epochs. Therefore, the cost of query processing in our scheme is acceptable.
Table 1 is the comparison between our scheme and the trivial approach. Note that
the value of m is small in actual applications. In addition, the cost of DO in our scheme
is radically reduced for the work of data processing are shared by n DCs. Additionally,
since we adopt (m; N)-key-insulated method to update keys in each epoch, the data in
our scheme is more secure compared with the data in the trivial approach that are
encrypted by unchanged key during N epochs. Furthermore, our integrity verification
method can verify that the semi-trusted CSP whether omit query results or not. The
additional cost of key update and integrity verification are acceptable.
60 Z. Ting et al.
Fig. 4. The time for query processing with different amount of buckets (n ¼ 100, m ¼ 5,
c ¼ 0:3).
6 Conclusion
In this paper, we are the first to construct a scheme for realizing multidimensional range
query for real-time data. We adopt (m; N)-key-insulated method to bucketization
method and radically reduce the cost of DO. In our scheme, DO don`t need to do
anything at data processing in each epoch, and it only executes query when it wants.
Each DC undertakes the work of updating encryption keys, encrypting and generating
verifying numbers for its data in each epoch, and experiments show that the cost for
each DC to do these works is acceptable for practice. By using (m; N)-key-insulated
method which is semantically secure under the DDH assumption, we improve the
security of data (an adversary who compromises at most m epochs can only destroy the
security of the data in the epochs that are compromised) and also simplify the key
distribution of DO. Furthermore, we can verify whether the semi-trusted CSP omits
some query results or not and thereby ensure query-result integrity. Because of space
constraints, we leave further research on the optimal method for reducing false posi-
tives and the risk of information disclosure for the future.
Acknowledgements. Our work is sponsored by the national natural science foundation of China
(research on privacy protecting cipher text query algorithm in cloud storage, No. 61472064), the
science and technology foundation of Sichuan province (research and application demonstration
on trusted and safety-controllable privacy protecting service architecture for cloud data,
2015GZ0095) and the fundamental research funds for the central universities (research on some
key technology in cloud storage security, YGX2013J072).
Privacy-Preserving Multidimensional Range Query on Real-Time Data 61
References
1. Shi, E., Bethencourt, J., Chan, H.T.-H., Song, D.X., Perrig, A.: Multi-dimensional range
query over encrypted data. In: IEEE S&P (2007)
2. Boneh, D., Waters, B.: Conjunctive, subset, and range queries on encrypted data. In:
Vadhan, S.P. (ed.) TCC 2007. LNCS, vol. 4392, pp. 535–554. Springer, Heidelberg (2007)
3. Li, J., Omiecinski, E.R.: Efficiency and security trade-off in supporting range queries on
encrypted databases. In: Jajodia, S., Wijesekera, D. (eds.) Data and Applications Security
2005. LNCS, vol. 3654, pp. 69–83. Springer, Heidelberg (2005)
4. Agrawal, R., Kiernan, J., Srikant, R., Xu, Y.: Order-preserving encryption for numeric data.
In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of
Data, pp. 563–574. ACM (2004)
5. Boldyreva, A., Chenette, N., Lee, Y., O’Neill, A.: Order-preserving symmetric encryption. In:
Joux, A. (ed.) EUROCRYPT 2009. LNCS, vol. 5479, pp. 224–241. Springer, Heidelberg
(2009)
6. Hacigümüş, H., Lyer, B., Li, C., Mehrotra, S.: Executing SQL over encrypted data in the
database-service-provider model. In: Proceedings of the 2002 ACM SIGMOD International
Conference on Management of Data, pp. 216–227. ACM (2002)
7. Hore, B., Mehrotra, S., Tsudik, G.: A privacy-preserving index for range queries. In:
Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30,
pp. 720–731. VLDB Endowment (2004)
8. Hore, B., Mehrotra, S., Canim, M., Kantarcioglu, M.: Secure multidimensional range queries
over outsourced data. Int. J. Very Large Data Bases 21, 333–358 (2012)
9. Girault, M.: Relaxing tamper-resistance requirements for smart cards by using (auto-) proxy
signatures. In: Quisquater, J.-J., Schneier, B. (eds.) CARDIS 1998. LNCS, vol. 1820,
pp. 157–166. Springer, Heidelberg (2000)
10. Dodis, Y., Katz, J., Xu, S., Yung, M.: Key-insulated public key cryptosystems. In: Knudsen,
L.R. (ed.) EUROCRYPT 2002. LNCS, vol. 2332, pp. 65–82. Springer, Heidelberg (2002)
11. Tzeng, W.-G., Tzeng, Z.-J.: Robust key-evolving public key encryption schemes. In: Deng,
R.H., Qing, S., Bao, F., Zhou, J. (eds.) ICICS 2002. LNCS, vol. 2513, pp. 61–72. Springer,
Heidelberg (2002)
12. Lu, C.-F., Shieh, S.-P.: Secure key-evolving protocols for discrete logarithm schemes. In:
Preneel, B. (ed.) CT-RSA 2002. LNCS, vol. 2271, pp. 300–309. Springer, Heidelberg (2002)
13. Hanaoka, G., Hanaoka, Y., Imai, H.: Parallel key-insulated public key encryption. In: Yung,
M., Dodis, Y., Kiayias, A., Malkin, T. (eds.) PKC 2006. LNCS, vol. 3958, pp. 105–122.
Springer, Heidelberg (2006)
14. Zhang, R., Shi, J., Zhang, Y.: Secure multidimensional range queries in sensor networks. In:
Proceedings of the Tenth ACM International Symposium on Mobile Ad Hoc Networking
and Computing, pp. 197–206. ACM (2009)
15. Phan Van Song, Y.-L.: Query-optimal-bucketization and controlled-diffusion algorithms for
privacy in outsourced databases. Project report, CS5322 Databases Security-2009/2010
16. Hore, B., Mehrotra, S., Tsudik, G.: A privacy-preserving index for range queries. In:
Proceedings of the Thirtieth International Conference on Very Large Data Bases, vol. 30,
pp. 720–731. VLDB Endowment (2004)
17. Bellare, M., Desai, A., Jokipii, E., Rogaway, P.: A concrete security treatment of symmetric
encryption. In: Foundations of Computer Science, pp. 394–403 (1997)
18. Papamanthou, C., Tamassia, R., Triandopoulos, N.: Authenticated hash tables. In:
Proceedings of the 15th ACM Conference on Computer and Communications Security,
pp. 437–448. ACM (2008)
ARM-Based Privacy Preserving
for Medical Data Publishing
Abstract. The increasing use of electronic medical records (EMR) makes the
medical data mining becomes a hot topic. Consequently, medical privacy
invasion attracts people’s attention. Among these, we are particularly interested
in the privacy preserving for association rule mining (ARM). In this paper, we
improve the traditional reconstruction-based privacy preserving data mining
(PPDM) and propose a new architecture for medical data publishing with pri-
vacy preserving, and we present a sanitization algorithm for the sensitive rules
hiding. In this architecture, the sensitive rules are strictly controlled as well as
the side effects are minimized. And finally we performed an experiment to
evaluate the proposed architecture.
1 Introduction
As the size of healthcare data increase dramatically, more and more medical infor-
mation system are using digitalized technology to realize the storage of the big data.
The electronic form of healthcare data is called EMR (electronic medical records). The
storage of these information are all precious wealth of human being. These medical
information resources are very valuable for disease treatment, diagnosis and medical
research. And data mining, that is to find the valuable knowledge hidden in these
massive medical data resources, has become a very important research topic. One of the
typical application conditions of data mining is association rule mining(ARM). ARM is
to find a rule set such that the support and confidence value of each rule is bigger than
the giving threshold.
Despite that a great number of valuable knowledge are discovered by ARM, these
rules may include some privacy information for the patients and medical department,
people have shown increasing concern about privacy violation brought by the tech-
nology. Because there are person-specific information contained in the medical system,
publish of data will cause unconscious privacy leakage, which may bring some bother
to victim. Typically privacy information can be classified into two categories [1]: one is
that you can get directly from the original data, which can be protected by the
methodologies like perturbation, sampling, generalization/suppression, transformation,
2 Related Works
Since the concept of PPDM was first proposed in 1999 by Agrawal [2], there were a
great number of achievements in this field by now [3–5]. Data mining technology is an
inter-discipline includes lots of data analysis techniques like statistics and machine
learning, and hence the diversity of privacy-preserving techniques on it.
Lei Chen identifies the incompatibilities between the traditional PPDM and the
typical free text Chinese EMR. And he proposed a series of new algorithms to solve the
problem in [6]. Then he also designed a new framework of privacy preserving for
healthcare data publishing based on the method in [7]. There are also some achieve-
ments for medical data protection in [8, 9].
Typically, privacy-preserving technology in the field of data mining can be divided
into two categories. For one kind is to protect the sensitive data itself, like name, ID
number. For medical data, the Health Insurance Portability and Accountability Act
(HIPAA) was enacted in US in 1996 [10]. The Act announced the personal health
information privacy standards and guidelines for implementation. There are many
kinds of technologies to protect it. The most common one is anonymization, including
k-anonymity [11], l-diversity [12], t-closeness [13]. In [14] Aristides Gionis and Tamir
Tassa extended the framework of k-anonymity to include any type of generalization
operators and define three measures to count the loss of information more accurately.
64 Z. Fengli and B. Yijing
They also proved that the problem of k-anonymity with minimal loss of data is
NP-hard. And then proceed to describe an approximation algorithm with an approxi-
mation guarantee of O(ln k).
Another kind of category is to protect the sensitive data mining results that were
produced in the process of data mining [15]. Generally, these techniques are focusing
on the improvement of data mining algorithms. Among them, we are particular
interested in approaches proposed to perform association rule hiding [16, 17]. In [18] a
new algorithm for hiding sensitive rules using distortion techniques are proposed by
Jain. The confidence of the sensitive rules are reduced by altering the position of them.
In [19] Le proposed a heuristic algorithm to identify the items which has the least
impact on the mining results, and remove these items from the original data set. There
are also many achievements in [20–22]. In [23] Verykios proposed three groups of
algorithms to hide the sensitive rules based on reducing the support and confidence
values, which can be seen as the precursors to the algorithms we proposed in this paper.
As showed in Fig. 1, in our architecture, there are three types of data characters: data
owner, data collector, and data user. Data owner provides the original medical dataset
and privacy definition and policy; data collector processes ARM and sanitization
algorithm and publishes new dataset to data user. And we assume that data collector
and network communications are reliable.
Fig. 1. ARM-based system architecture of privacy preserving for medical data publishing
The process of the architecture is showed in Fig. 2. Data owner provides original
datasets D and privacy definition and policy. Data collector is the main part of the
procedure who transforms the original datasets D into a “clean” datasets D′ by applying
a series a privacy-preserving algorithms, and finally publishes D’ to data user.
The procedure of the framework is as follows:
step 1: Data owner provides the original datasets D ¼ fT1 T2 ; T3 ; . . .; Tm g and
privacy definition and policy to data collector.
step 2: Data collector uses ARM algorithm to generate association rule set R.
step 3: Data collector identifies corresponding sensitive association rules Rh for
different data users according to the privacy definition and policy.
66 Z. Fengli and B. Yijing
D = {T1 ,T 2 ,..., T m }
step 4: Data collector performs sanitization algorithm to hide the rules in Rh from
the dataset.
step 5: Step 2 to step 5 is repeated until there are no sensitive rules could be found in
the mining results. The rest of the dataset are “clean”.
step 6: The new dataset D’ with no sensitive patterns will be published to data user.
As depicted in Fig. 2, we do not use a reconstruction algorithm after sanitization as
traditional framework does. Because it has been proven that the relationship between
the transactions and their k-itemsets is one to one correspondence [24]. That is to say,
direct modification in dataset D during sanitization algorithm, or cleaning sensitive
rules from k-itemsets and then reconstructing a new dataset D′ using the clean
k-itemsets, we can get the same result. In order to improve the efficient of our archi-
tecture, we hide rules directly from dataset D.
After the labeling, an item can be transformed into a transaction format as shown in
the Table 3. The first column represents transaction ID, namely the patient number. The
second column represents all the items recorded in his(her) EMR. The set of items
I = {1, 2, 3, 4, 5, 6,…}. Our medical datasets can be transformed into a transaction
format D ¼ fT1 T2 ; T3 ; . . .; Tm g. After the labeling, our framework can be easily used
on the medical data.
In our architecture, the sanitization algorithm will be repeated until there exists no
sensitive patterns in the discovered rules. This will ensure that no unexpected infor-
mation from Rh will be published to unreliable party.
ARM-Based Privacy Preserving for Medical Data Publishing 69
5 Performance Evaluation
In this part, we perform our framework on a computer running windows server 2008
R2 operating system. The dataset we use is generated by ourselves as shown in the
Table 4. In order for a more realistic simulation, we make the length of a transaction
ranges from 2 to 7. All the performance are implemented on matlab R2012b.
We use three criteria to evaluate the performance of our framework: 1. time
required of sanitation algorithm. 2. Number of lost rules. 3. Number of new rules. Lost
rules are the non-sensitive rules that can be mined before the sanitization algorithm and
lost after that. New rules are the non-sensitive rules that cannot be mined before the
sanitization algorithm and being introduced to the mining result after that. Rules hiding
would have some side effect on the mining result. Number of lost rules and new rules
are used to evaluate the side effect of our sanitization algorithm.
[5,14,22=>23] and [5,14,23=>22] are very likely also sensitive rules. Consequently,
the side-effect of lost rules is minimized in our architecture.
6 Conclusion
Finally, we experiment the algorithm on seven sets of dateset. And we use three
criteria to evaluate the performance of algorithm: 1. time requirement of the process. 2.
number of lost rules. 3. number of new rules. Lost rules are the rules mined before the
algorithm but can’t be mined after that. New rules are the rules can’t be mined before
the algorithm but being introduced to the mining result after the process. We believe
that the proposed architecture could satisfy the demands of the health department on
medical data publishing.
Our future plan is to work on the privacy measurement issues. We hope to develop
different hiding strategies and different arguments according to different privacy met-
rics, in order to adapt to different data users. Moreover, we hope to use the actual
medical dataset to perform the experiment in order to get a more real mining results.
References
1. Malik, M.B., Ghazi, M.A., Ali, R.: Privacy preserving data mining techniques: current
scenario and future prospects. In: 3rd IEEE International Conference on Computer
Communication Technology (ICCCT), pp. 26–32 (2012)
2. Agrawal, R., Srikant, R.: Privacy preserving data mining. In: Proceedings of
ACM SIGMOD Conference, pp. 439–450 (2000)
3. Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy preserving data publishing: a survey of
recent developments, ACM Comput. Surv. 42(4), art. id 14 (2010)
4. Xu, L., Jiang, C.: Information security in big data: privacy and data mining. IEEE 2(10)
(2014)
5. Matwin, S.: Privacy preserving data mining techniques: survey and challenges. In: Custers,
B., Calders, T., Schermer, B., Zarsky, T. (eds.) Discrimination and Privacy in the
Information Society, pp. 209–221. Springer, Berlin (2013)
6. Chen, L., Yang, J.: Privacy-preserving data publishing for free text chinese electronic
medical records. In: IEEE 35th International Conference on Computer Software and
Applications, pp. 567–572 (2012)
7. Chen, L., Yang, J.: A framework for privacy-preserving healthcare data sharing. In: IEEE
14th International Conference on e-Healthcare Networking, Applications and Services,
pp. 341–346 (2012)
8. Hossain, A.A., Ferdous, S.M.S.: Rapid cloud data processing with healthcare information
protection. In: IEEE 10th World Congress on Services, pp. 454–455 (2014)
9. Alabdulatif, A., Khalil, I.: Protection of electronic health records (EHRs) in cloud. In: 35th
Annual International Conference of the IEEE EMBS Osaka, Japan, pp. 4191–4194 (2013)
10. HIPAA-General Infromation. http://www.cms.gov/HIPPAGenInfo/
11. Sweeney, L.: K-anonymity: a model for protecting privacy. Int. J. Uncertainty Fuzziness
Knowl.-Based Syst. 10(5), 557–570 (2002)
12. Machanavajjhala, A., Gehrke, J.. Kifer, D., Venkitasubramaniam, M.: l-diversity: Privacy
Beyond k-anonymity. In: International Conference on Data Engineering (ICDE), pp. 24–35.
IEEE Computer Society, Atlanta (2006)
13. Li, N.H., Li, T.C., Venkatasubramanian, S.: t-closeness: privacy beyond k-anonymity and
l-diversity. In: 23rd IEEE International Conference on Data Engineering (ICDE), pp. 106–115.
IEEE Computer Society, Istanbul (2007)
14. Gionis, A., Tassa, T.: k-anonymization with minimal loss of information. IEEE Trans.
Knowl. Data Eng. 21(2), 206–219 (2009)
ARM-Based Privacy Preserving for Medical Data Publishing 73
15. Verykios, V.S., Bertino, E., Fovino, I.N., Provenza, L.P., Saygin, Y.: State of the art in
privacy preserving data mining. ACM SIGMOD Rec. 33(1), 50–57 (2004)
16. Sathiyapriya, K., Sadasivam, G.S.: A survey on privacy preserving association rule mining.
Int. J. Data Mining Knowl. Manage. Process 3(2), 119 (2013)
17. Zhu, J.M., Zhang, N., Li, Z.Y.: A new privacy preserving association rule mining algorithm
based on hybrid partial hiding strategy. Cybern. Inf. Technol. 13, 41–50 (2013)
18. Jain, D., Khatri, P., Soni, R., Chaurasia, B.K.: Hiding sensitive association rules without
altering the support of sensitive item(s). In: Meghanathan, N., Chaki, N., Nagamalai, D.
(eds.) CCSIT 2012, Part I. LNICST, vol. 84, pp. 500–509. Springer, Heidelberg (2012)
19. Le, H.Q., Arch-Int, S., Nguyen, H.X., Arch-Int, N.: Association rule hiding in risk
management for retail supply chain collaboration. Comput. Ind. 64(7), 776–784 (2013)
20. Dehkordi, M.N.: A novel association rule hiding approach in OLAP data cubes.
Indian J. Sci. Technol. 6(2), 4063–4075 (2013)
21. Bonam, J., Reddy, A.R., Kalyani, G.: Privacy preserving in association rule mining by data
distortion using PSO. In: Satapathy, S.C., Avadhani, P.S., Udgata, S.K., Lakshminarayana,
S. (eds.) Proceedings of the ICT Critical Infrastructure, Proceedings of 48th Annual
Convention Computer Society India, vol. 2, pp. 551–558. Springer (2014)
22. Radadiya, N.R., Prajapati, N.B., Shah, K.H.: Privacy preserving in association rule mining.
Int. J. Adv. Innovative Res. 2(4), 203–213 (2013)
23. Verykios, V.S.: Association rule hiding methods. Wiley Interdiscipl. Rev. Data Mining
Knowl. Discovery 3(1), 28–36 (2013)
24. Chen, X., Orlowska, M., Li, X.: A new framework of privacy preserving data sharing. In:
Proceedings of the 4th IEEE ICDM Workshop: Privacy and Security Aspects of Data
Mining, pp. 47–56. IEEE Computer Society (2004)
Attribute-Based Encryption Without
Key Escrow
1 Introduction
Do you think that your data storing in the online cloud storage are secure? Although
cloud storage service providers, such as Dropbox, Google, Microsoft and so on,
announce that they provide security mechanisms for protecting their systems, how
about cloud storage service providers themselves? It is convenient for us to access our
data anytime and anywhere after moving our data to the cloud. We must remain
vigilant on the security and privacy of our data, especially sensitive data. It is better to
encrypt sensitive data previous to uploading them to the cloud storage. Thus, even if
the cloud storage is broken, the privacy of our data will not be leaked. One shortcoming
of encrypting data as a whole is that it severely limits the flexibility of users to share
their encrypted data at a fine-grained dimension. Assuming a user wants to grant access
permission of all documents of a certain project to a project member, he either needs to
act as an intermediary and decrypt all relevant files for this member or must give this
member his secret decryption key. Neither of these options is particularly attractive.
Especially, it is tough when the user wants to share different documents with different
people.
Sahai and Waters [1] firstly proposed the concept of Attribute-Based Encryption
(ABE) to address this issue. ABE is a cryptographic primitive for fine-grained data
access control in one-to-many communication. In traditional Identity-Based Encryption
(IBE) [2], the ciphertext is computed according to the targeted user’s identity, and only
that user himself can decrypt the ciphertext. It is one-to-one communication. As a
generalization of IBE, ABE introduces an innovative idea of access structure in public
key cryptosystem, making the user’s secret key or ciphertext generated based on an
access structure. Only the user who meets the specified conditions can decrypt the
ciphertext.
Nevertheless, ABE has a major shortcoming which is called the key escrow
problem. We clarify the problem as two types: (1) Type 1: key generation center
(KGC) can generate a user’s secret key with arbitrary access structures or set of
attributes, (2) Type 2: KGC can decrypt the ciphertext directly utilizing its master key.
These could be potential threats to the data confidentiality and privacy in the cloud
storage, thereby affecting the extensive application in the cloud storage.
Why do we need to solve the key escrow problem? Isn’t KGC trusted? Let’s give
an example with public key infrastructure (PKI). A PKI is an arrangement that binds
public keys with respective users’ identities with a certificate authority (CA). There is
one point to note that PKI doesn’t know users’ secret keys, although PKI is trusted.
However, users’ secret keys are generated by KGC in ABE. Even if KGC is trusted, we
still don’t want it to decrypt our encrypted data.
Through our research, we give an informal conclusion that an ABE scheme has
the key escrow problem inherently if there is only one authority (KGC) in the
scheme. The secret key of a user is generated by KGC and there isn’t user-specific
information in the ciphertext. Otherwise, it will be contrary to the goal of ABE which is
designed for fine-grained data sharing. Therefore, we pay our attention to how the
cooperation between two authorities to solve the key escrow problem.
key to him/her. The drawback of this approach is that it doesn’t have universality and it
is proved in the random oracle model. Zhang et al. [12] proposed a solution to solve
key escrow problem. Zhang et al. introduced another secret key x that KGC does not
know. This has some taste of our proposed scheme. However, since the user can
acquire x, if the user colludes with KGC, KGC can decrypt any ciphertext. And Zhang
et al. just applied this idea for FIBE.
Wang et al. [13] achieved authority accountability by combining Libert and
Vergnaud’s IBE scheme [14] and KP-ABE [3]. As the user’s secret key contains the
secret information that KGC does not know, if KGC forges secret keys in accordance
with the user’s identity, we can fine whether KGC or the user is dishonest according the
key family number. However, KGC can still decrypt the ciphertext directly using its
master key.
1.4 Organization
The rest of this paper is arranged as follows. Section 2 introduces some cryptographic
background information. Section 3 describes the formal definition of CP-ABE without
key escrow (WoKE-CP-ABE) and its security model. In Sect. 4, we propose the
construction of our WoKE-CP-ABE scheme. In Sect. 5, we analyze the security of our
proposed scheme and compare our scheme with multi-authority attribute-based
encryption. In Sect. 6, we discuss some extensions. Finally, we conclude this paper.
2 Background
Definition 2.1 Access Structure [16]. Let fP1 ; P2 ; . . .; Pn g be a set of parties. A col-
lection A2fP1 ;P2 ;...;Pn g is monotone if 8B; C: if B 2 A and BC then C 2 A. An access
structure (respectively, monotone access structure) is a collection (respectively,
monotone collection) A of non-empty subsets of fP1 ; P2 ; . . .; Pn g, i.e.,
A2fP1 ;P2 ;...;Pn g nf;g. The sets in A are called the authorized sets, and the sets not in A
are called the unauthorized sets.
In our context, the role of the parties is taken by the attributes. Thus, the access
structure A will contain the authorized sets of attributes. From now on, we focus on
monotone access structures.
78 X. Zhang et al.
Definition
Q 2.2 Linear Secret Sharing Schemes (LSSS) [16]. Let K be a finite field,
and Qscheme with domain of secrets S 2 K realizing an access
be a secret sharing
structure A. We say that is a linear secret sharing scheme over K if:
1. The piece of each party is a vector over K. That is, for every i there exists a constant
Q
di such that the piece of Pi is taken from Kdi . We denote by i;j ðs; rÞ the j-th
coordinate in the piece of Pi (where s 2 S is a secret and r 2 R is the dealer’s
random input).
2. For every authorized set, the reconstruction function of the secret from the pieces is
linear. That is, for every G 2 A there exist constants fai;j : Pi 2 G; 1 j di g, such
that for every secret s 2 S and every choice of random inputs r 2 R,
X X Y
s¼ ai;j i;j
ðs; r Þ
Pi 2G 1 j di
where the constants and the arithmetic are over the field K.
Pn
The total size of the pieces in the scheme is defined as d , i¼1 di .
Definition 2.3 Bilinear Map. Let G0 and G1 be two multiplicative cyclic groups of
prime order p. Let g be a generator of G0 and e be a bilinear map, e : G0 G0 ! G1 .
The bilinear map e has the following properties:
Bilinearity: for all u; v 2 G0 and a; b 2 Zp , we have e ua ; vb ¼ eðu; vÞab .
Non-degeneracy: eðg; gÞ 6¼ 1.
Computable: there exists an efficient algorithm for the bilinear map
e : G0 G0 ! G1 .
Notice that the map e is symmetric since e ga ; gb ¼ eðg; gÞab ¼ eðgb ; ga Þ.
2.3 Assumption
We state our complexity assumption below.
Definition 2.4 q-type Assumption. Initially the challenger calls the group generation
algorithm with input the security parameter, picks a random group element g 2 G0 , and
q + 2 random exponents a; s; b1 ; b2 ; . . .; bq 2 Zp . Then he sends to the adversary the
group description ðp; G0 ; G1 ; eÞ and all of the following terms:
Attribute-Based Encryption Without Key Escrow 79
g; gs
i i i 2
ga ; gbj ; gsbj ; ga bj ; ga bj 8ði; jÞ 2 ½q; q
ai bj =b20
0
0
g j 8 i; j; j 2 ½2q; q; qwithj 6¼ j
ga =bj
i
8ði; jÞ 2 ½2q; qwithi 6¼ q þ 1
sai bj =b20
sai bj =bj0 0 0
g ;g j 8 i; j; j 2 ½q; q; qwithj 6¼ j
3.1 Definition
A WoKE-CP-ABE consists of five algorithms.
KGC-Setup ð1k Þ ! ðPKKGC ; MKKGC Þ This is a randomized algorithm that takes a
security parameter k 2 N as input. It outputs the public parameters PKKGC and master
key MKKGC.
OAA-Setup ðPKKGC Þ ! ðPKOAA ; MKOAA Þ This is a randomized algorithm that
takes PKKGC as input. It outputs the public parameters PKOAA and master key MKOAA.
The system’s public parameters PK can be viewed as PKKGC [ PKOAA.
Key Generation This is a key issuing protocol. In this protocol, the KGC and OAA
generate the user’s secret key SK with a set of attributes S collaboratively.
Encryption ðPK; M; T Þ ! CT This is a randomized algorithm that takes as input
the public parameters PK, a plaintext message M, and an access structure T . It outputs
the ciphertext CT.
Decryption ðCT; SK; PKÞ ! M This algorithm takes as input the ciphertext CT
that is encrypted under an access structure T , the decryption key SK for a set of
attributes S and the public parameters PK. It outputs the message M if T ðS Þ ¼ 1.
www.allitebooks.com
80 X. Zhang et al.
Challenge The adversary A submits two equal length messages M0 and M1 . The
challenger B flips a random coin b, and encrypts Mb with T . Then B passes the
ciphertext to the adversary A.
Phase 2 Phase 1 is repeated.
0
Guess The adversary A outputs a guess b of b.
0
The advantage of an adversary A in this game is defined as j Pr b ¼ b 1=2j.
Definition 3.1. A ciphertext-policy attribute-based encryption scheme without key
escrow is selectively secure if all PPT adversaries have at most negligible advantage in
k in the above security game.
4 Our Construction
Nevertheless, attributes can be any meaningful unique strings using a collision resistant
hash function H : f0; 1g ! Zp .
Our construction follows.
KGC-Setup ð1k Þ ! ðPK KGC ; MK KGC Þ The algorithm calls the group generator
algorithm Gð1k Þ and gets the descriptions of the groups and the bilinear map
0
D ¼ ðp; G0 ; G1 ; eÞ. Then choose the random terms g; u; h; w; v 2 G0 and a 2 Zp . The
published public parameters PKKGC are
0
D; g; u; h; w; v; eðg; gÞa :
0
The master key MKKGC is a .
OAA-Setup ðPK KGC Þ ! ðPK OAA ; MK OAA Þ Choose l uniformly at random in Zp .
0
We can view a as a=l. The published public parameters PKOAA is
0
ðeðg; gÞa Þl ¼ eðg; gÞa :
Key Generation KGC and OAA are involved in the user’s key issuing protocol. In
the protocol, KGC needs to communicate OAA to generate the user’s secret key. The
key issuing protocol consists of the following steps:
1. Firstly, KGC and OAA authenticate a user U with set of attributes S ¼
fA1 ; A2 ; . . .; Ak gZp independently.
2. KGC selects a random exponent h 2R Zp and sends it to U. h is used to prevent
OAA from obtaining U’s complete secret key.
0 0 0 0
3. KGC picks r ; r1 ; r2 ; . . .; rk 2R Zp and computes
0 0 0 0 0
S; K0 ¼ ga =h wr =h ; w1=h ; K1 ¼ gr ;
0 0 0 0 0
fKi;2 ¼ gri ; Ki;3 ¼ ðuAi hÞri vr gi2½k :
00 00
S; K0 ¼ ðK0 Þl ðw1=h Þr ; K1 ¼ ðK1 Þl gr ;
00 0 00 0
00 00 00
fKi;2 ¼ ðKi;2 Þl gri ; Ki;3 ¼ ðKi;3 Þl ðuAi hÞri vr gi2½k :
00 0 00 0
82 X. Zhang et al.
00 0 0 00 00
SK ¼ ðS; h; K0 ¼ K0 ¼ ga l=h wðr l þ r Þ=h
¼ ga=h wr=h ; K1 ¼ K1
0 00
¼ gr l þ r ¼ gr ;
0 00 0 00
We implicitly set r ¼ r l þ r ; fri ¼ ri l þ ri gi2½k .
Encryption ðPK; m; ðM; qÞÞ ! CT To encrypt a message m 2 G1 under an access
structure encoded in an LSSS policy ðM; qÞ. Let the dimensions of M be l n. Each
row of M will be labeled by an attribute and qðiÞ denotes the label of ith row M ~ i . Choose
T
a random vector ~z ¼ ðs; z2 ; . . .; zn Þ from Zp , s is the random secret to be shared among
n
Decryption ðCT; SK; PKÞ ! m To decrypt the ciphertext CT with the decryption
key SK, proceed as follows. Suppose that S satisfies the access structure and let
I ¼ fi : qðiÞ 2 Sg. Since the set of attributes satisfy the access structure, there exist
P
coefficients xi 2 Zp such that ~ i ¼ ð1; 0; . . .; 0Þ. Then we have that
xi M
qðiÞ2I
P
xi ki ¼ s. Now it calculates
qðiÞ2I
Q
meðg; gÞas i2I ðe Ci;1 ; K1 eðCi;2 ; Ki;2 ÞeðCi;3 ; Ki;3 ÞÞxi
ðeðC0 ; K0 ÞÞh
Q k t r qðiÞ ti r
meðg; gÞas i2I ðe w i v i ; g eð u h ; g i Þeðgti ; ðuAi hÞri vr ÞÞxi
¼
ðeðgs ; ga=h wr=h ÞÞh
Type-II adversary. Type-II adversary is defined as a curious KGC and restricted that
he cannot collude with any user or OAA. The adversary needs to recover l or s to
decrypt ciphertext meðg; gÞas .
0
ðeðga ; gs ÞÞl ! eðg; gÞas ! m by using l;
logarithm is believed to be difficult, our scheme can resist the attack from Type-III
adversary. □
Type-IV adversary. Type-IV adversary is defined as dishonest users colluding with
KGC. Although this adversary can request some users’ secret keys, he cannot obtain
more information about l than Type-II adversary as the users’ secret keys are ran-
domized by OAA. This adversary also needs to recover l or s to decrypt ciphertext.
Since computing discrete logarithm is believed to be difficult, our scheme can resist the
attack from Type-IV adversary. □
Type-V adversary. Type-V adversary is defined as dishonest users colluding with
OAA. Obviously, this adversary has less power than Type-IV adversary. The adversary
0 0
needs to recover a , s or r to decrypt ciphertext.
0 0 0
ðga =h wr =h Þhl ! ga wr l :
0
If this adversary knows r , he can calculate any set of attributes by using h from
0
dishonest users. r is related with discrete logarithm problem. So our scheme can resist
the attack from Type-V adversary. □
84 X. Zhang et al.
6 Extensions
6.1 Universality
Our method for removing key escrow can be applicable to other ABE schemes. It is
easy to transform a single authority of an ABE scheme to KGC and OAA. At setup
stage, KGC performs the same Setup as the original scheme. OAA performs exponent
operation on a-related part eðg; gÞa .
Now we will mainly focus on key generation. We will analyze universality for
CP-ABE and KP-ABE respectively.
For CP-ABE, we will describe the transformation method by analyzing our pro-
posed scheme. The main difference between secret key of our scheme and Rouselakis
and Waters’ [6] is the generation of K0 . In our scheme,
0 0
r00 0 0 00
K0 ¼ ga =h wr =h ;w1=h ; h ! ðK0 Þl w1=h ¼ ga l=h wðr l þ r Þ=h
0 0
¼ ga=h wr=h ; h:
In Rouselakis and Waters [6], K0 ¼ ga wr . When key generating, we use two more
parts, w1=h and h. We can call h as the key-escrow-free part and w1=h as the affiliated
part for randomization of secret key. Notice that OAA also needs to randomize the
Attribute-Based Encryption Without Key Escrow 85
secret key. The reason we have analyzed in Sect. 4. By using these two parts, we have
already been able to construct a scheme without key escrow.
For KP-ABE, there is a little different from CP-ABE. Let us look at an example.
For KP-ABE in Rouselakis and Waters’ [6], KGC also needs to generate a
key-escrow-free part h and sends it to the user. Then KGC uses the algorithm of Key
0
Generation in [6], the only difference is replacing a with a =h. KGC sends it to OAA.
OAA performs exponent operation with l and randomization operation similarly with
ours. Then OAA sends it to the user. From this example we can see that we only handle
on exponent. It doesn’t matter the format of the user’s secret key in the original
scheme.
The algorithm of Encryption is identical and h is used in the algorithm of
Decryption. As it is apparent, we will not analyze it any more.
7 Conclusion
Key escrow is quite a challenging issue in ABE. We formalize the concept of ciphertext
policy attribute-based encryption without key escrow (WoKE-CP-ABE) and propose a
scheme for solving the key escrow problem. In our construction, we use two author-
ities, KGC and OAA (outsourced attribute authority) which communicate with each
other to issue secret keys for users. Unless KGC colludes with OAA, neither KGC nor
OAA can decrypt the ciphertext independently. Our scheme is proved to be selectively
86 X. Zhang et al.
secure in the standard model. We give universal methods for transforming both
KP-ABE and CP-ABE with a single authority to solve the problem of key escrow. In
addition, our scheme naturally supports outsourcing the decryption of ciphertexts. As
KGC’s behavior is restricted in ABE with a single authority, it will drive people to
store more sensitive data into cloud and promote the application of ABE in a wider
range.
Acknowledgments. This work is supported by the National High Technology Research and
Development Program (“863” Program) of China under Grant No. 2015AA016009, the National
Natural Science Foundation of China under Grant No. 61232005, and the Science and Tech-
nology Program of Shen Zhen, China under Grant No. JSGG2014051 6162852628.
References
1. Sahai, A., Waters, B.: Fuzzy identity-based encryption. In: Cramer, R. (ed.) EUROCRYPT
2005. LNCS, vol. 3494, pp. 457–473. Springer, Heidelberg (2005)
2. Boneh, D., Franklin, M.: Identity-based encryption from the weil pairing. In: Kilian, J. (ed.)
CRYPTO 2001. LNCS, vol. 2139, pp. 213–229. Springer, Heidelberg (2001)
3. Goyal, V., Pandey, O., Sahai, A., Waters, B.: Attribute-based encryption for fine-grained
access control of encrypted data. In: ACM Conference on Computer and Communications
Security, pp. 89–98 (2006)
4. Ostrovsky, R., Sahai, A., Waters, B.: Attribute-based encryption with non-monotonic access
structures. In: ACM Conference on Computer and Communications Security, pp. 195–203
(2007)
5. Attrapadung, N., Libert, B., de Panafieu, E.: Expressive key-policy attribute-based
encryption with constant-size ciphertexts. In: Catalano, D., Fazio, N., Gennaro, R.,
Nicolosi, A. (eds.) PKC 2011. LNCS, vol. 6571, pp. 90–108. Springer, Heidelberg (2011)
6. Rouselakis, Y., Waters, B.: Practical constructions and new proof methods for large universe
attribute-based encryption. In: ACM Conference on Computer and Communications
Security, pp. 463–474 (2013)
7. Bethencourt, J., Sahai, A., Waters, B.: Ciphertext-policy attribute-based encryption. In:
IEEE Symposium on Security and Privacy, pp. 321–334 (2007)
8. Cheung, L., Newport, C.: Provably secure ciphertext policy ABE. In: ACM Conference on
Computer and Communications Security, pp. 456–465 (2007)
9. Waters, B.: Ciphertext-policy attribute-based encryption: an expressive, efficient, and
provably secure realization. In: Catalano, D., Fazio, N., Gennaro, R., Nicolosi, A. (eds.)
PKC 2011. LNCS, vol. 6571, pp. 53–70. Springer, Heidelberg (2011)
10. Hur, J., Koo, D., Hwang, S.O., Kang, K.: Removing escrow from ciphertext policy
attribute-based encryption. Comput. Math Appl. 65(9), 1310–1317 (2013)
11. Hur, J.: Improving security and efficiency in attribute-based data sharing. IEEE Trans.
Knowl. Data Eng. 25(10), 2271–2282 (2013)
12. Zhang, G., Liu, L., Liu, Y.: An attribute-based encryption scheme secure against malicious
KGC. In: IEEE 11th International Conference on Trust, Security and Privacy in Computing
and Communications (TrustCom), pp. 1376–1380 (2012)
13. Wang, Y., Chen, K., Long, Y., Liu, Z.: Accountable authority key policy attribute-based
encryption. Sci. China Inf. Sci. 55(7), 1631–1638 (2012)
Attribute-Based Encryption Without Key Escrow 87
14. Libert, B., Vergnaud, D.: Towards black-box accountable authority IBE with short
ciphertexts and private keys. In: Jarecki, S., Tsudik, G. (eds.) PKC 2009. LNCS, vol. 5443,
pp. 235–255. Springer, Heidelberg (2009)
15. Green, M., Hohenberger, S., Waters, B.: Outsourcing the decryption of ABE ciphertexts. In:
USENIX Security Symposium (2011)
16. Beimel, A.: Secure schemes for secret sharing and key distribution. PhD thesis, Israel
Institute of Technology, Technion, Haifa, Israel (1996)
17. Chase, M.: Multi-authority attribute based encryption. In: Vadhan, S.P. (ed.) TCC 2007.
LNCS, vol. 4392, pp. 515–534. Springer, Heidelberg (2007)
18. Chase, M., Chow, S.S.: Improving privacy and security in multi-authority attribute-based
encryption. In: ACM Conference on Computer and Communications Security, pp. 121–130
(2009)
19. Lewko, A., Waters, B.: Decentralizing attribute-based encryption. In: Paterson, K.G. (ed.)
EUROCRYPT 2011. LNCS, vol. 6632, pp. 568–588. Springer, Heidelberg (2011)
20. Liu, Z., Cao, Z., Huang, Q., Wong, D.S., Yuen, T.H.: Fully secure multi-authority
ciphertext-policy attribute-based encryption without random oracles. In: Atluri, V., Diaz, C.
(eds.) ESORICS 2011. LNCS, vol. 6879, pp. 278–297. Springer, Heidelberg (2011)
21. Wang, G., Liu, Q., Wu, J.: Hierarchical attribute-based encryption for fine-grained access
control in cloud storage services. In: Proceedings of the 17th ACM Conference on Computer
and Communications Security, pp. 735–737 (2010)
An Efficient Access Control Optimizing
Technique Based on Local Agency
in Cryptographic Cloud Storage
1 Introduction
Data security is the primary factor for usage and development of cloud storage [1, 2].
Shi Xiaohong, the vice president of inc.360, said that data security is the core issue of
cloud storage [3]. From the user’s perspective, the ideal state is under user-controllable
data protection for the need of cloud storage security. Most users, especially business
users hope to obtain protection for cloud storage as local storage, when the data is
stored, transmitted, shared, used and destructed in cloud [4]. Once data leak occurs, it
may cause incalculable damage, because user data may contain business decisions, the
core technology, trade secrets and personal privacy and other sensitive content.
Therefore, the security focus of user data in cloud storage is protection of data leakage.
The characteristics of cloud storage are open and shared, so the user data is vulnerable
to attacks and threats from networks. In addition, cloud service providers (CSP) and its
staff may also leak user data actively or passively.
The most direct way to prevent leak of cloud data is to encrypt data by the user and
then distribute the decryption key to control the access authority. However, it is not
efficient, and greatly increases time cost to use cloud storage. A main idea for data
confidentiality protection is to establish Cryptographic Cloud Storage [5] mechanism to
protect the security of cloud storage sharing data with the access control policies
managed by user own. In this way, even if the illegal users obtain the shared data, they
can’t get the plaintext.
With the target of user self-management on important data, Many researchers have
proposed data protection scheme based on ciphertext access control, that is data are
encrypted using symmetric algorithm by data owner, and publish both data and access
policies to cloud storage, then distribute decryption keys through secure channel. Data
users first fetch ciphertext from cloud storage, and decrypt ciphertext with their own
private key, and obtain decryption key. This method has large amount of computation
burden, and efficiency of key distribution is low.
With the development of attribute-based encryption technology, researchers have
proposed ciphertext access control technology based on Ciphertext-Policy Attribute-
Based Encryption (CP-ABE) [6, 7]. As access structure of CP-ABE is deployed in the
ciphertext, data owners have more initiative, may decide access structure by their own.
Chase resolved the bottleneck that attribute management and key distribution were
managed just by a single Attribute Authority (AA) [8], and proposed to allow multiple
independent AA to manage and distribute key properties. Waters presented the first
access structure with full expression of CP-ABE [9].
However, using CP-ABE technology to protect data security in cloud storage still
faces many problems. One is the large time cost of encryption, decryption and revo-
cation of CP-ABE, the other is that access granularity of key distribution strategy of
CP-ABE does not match the needs of most cloud storage scenarios. In this paper, we
focus on solving the access efficiency and fine-grained access, and proposed a scheme
of ciphertext access control based on hierarchical CP-ABE encryption algorithm, which
can support fine-grained access control, and using local proxy technology to optimize
the efficiency of cloud storage access.
encryption. In CP-ABE scheme, users are described by attribute set, data owner give a
access tree structure, if and only if user’s attribute set are satisfied access tree structure,
decryption key is able to be recovered, and get plaintext.
Cloud storage users generally can be divided into three types: individual users,
enterprise users and community users. The individual user refers to dependent the
individual person who use cloud storage service independently, its service form are like
network disk service. Individual users are less data and data shared requires, and can
implement data security protection by using simple ciphertext access control mecha-
nisms. Enterprise users are those belonging to a particular enterprise or group, which is
characterized by its attributes generally associated with hierarchical structure, they
usually have large data sharing requires with large amounts of data. Community users
refer to the user who have a certain relationship with each other, but geographically
dispersed, and have data shared requires. Consider the following scenario: teachers of
ten universities in one city will use the same cloud storage service, the staff of Infor-
mation Institute of A school and Information College of B school develop a research
project named X. Its access control list is shown in Table 1.
Therefore, we need to design an access control scheme that meet demands below:
1. Flexible access control policy. Access control tree should support access control of
precise identity or simple attribute access.
2. Access control structures can be represented as hierarchical structure like enterprise
administrative level or departmental structure.
3. The efficiency of key distribution should be adapted to users commonly used time
cost of encryption and decryption in cloud storage.
Based on HIBE system, we added precise identify attribute in access structure of
CP-ABE system, and introduce domain management in access control structure of
CP-ABE for generate keys and use hierarchy key distribution, as described below.
Generate
RM
Setup
Generate p UM UM
DM-Level 1 DM-Level 1
User User
This model is a tree structure, consisting of a root master (RM) node and a plurality
of domains, RM generates and distributes system parameters and domain keys, user
master (UM) manage users, domain master (DM) is used to generate keys and attributes
92 S. Zhu et al.
for the under hierarchy. RM, UM and DM node can be implemented by Local Proxy
(LP) or trusted third party (TTP). Each DM and attribute has a unique ID signs, and
each user has a unique ID identifier and a series of attribute flags. Ciphertext access
structure comprises ID set of data user (DU) and attribute-based access structure tree T.
When the data owner (DO) release access strategy, firstly, DM judge whether the ID of
DU is in its precise ID set, if True, DM authorize to decrypt ciphertext without
attributes judgment; otherwise, DM will analyze the attribute set in the key of DU
whether satisfy the access control policies T. Access control tree T is used to represent
an access control structure. Each non-leaf node x of T represents an kx-of-numx
threshold operation: numx indicates the number of child nodes, kx represents the
threshold, 0 < kx ≤ numx. kx=1 indicates OR operation, kx = numx indicates an AND
operation. Each leaf node x represents an attribute att(x). Access control tree can
describe attribute-based access control policy, to judge whether attribute set S satisfy an
access control tree T is as follows:
1. Let r be the root node of T, use Tr represents T, then let Tx be the subtree that root
node is x, if attribute S satisfy Tx, denoted Tx (S) = 1.
2. Let x = r, calculate Tx (S): For non-leaf node x, let all child nodes of x be xc1, …,
xcnumx, calculate Txci(S)(i2[1, numx]), Tx (S) returns 1 if and only if at least kx number
leaf nodes return 1; for the leaf node, Tx (S) returns 1 if and only if att (x) 2 S.
• Encrypt: Let DO is the data owner, Tdo be the access control tree of DO. Let DMi
located on the i layer to manage attribute-based access control tree Ti, for the
purpose of encrypt K, DO output ciphertext CK= ,Tdo,CF , its input parameters
are precise ID set ={IDu1,...,IDum}, attribute access control tree Tdo, and all user
public key in and all attributes key in Ti. The paramters:
Pui=H1(PKui) 1; Uui=rPui; V=K H2( (Q0,rnAPu)); CF=[u0,Uu1,...,Uum,V ].
• Decrypt: Given ciphertext CK, if the precise ID set of U belongs , then the key
K can be recovered by using system parameters params and user private key SKu.
Given ciphertext CK, if the user’s attributes satisfies access structure T, that means
U has at least one attribute key in access control tree T, then the plaintext will be
recovered by using identity key Skiu and the system parameters params, and user
attribute secret key {Skiu,a | a2T}.
The safety of this scheme depends on the safety of CP-ABE algorithm and HIBE
algorithm, due to space limited, we did not prove it in this paper.
2. Encryption and decryption time cost. Encryption time cost include symmetric
encryption algorithm and key encryption time cost using CP-ABE algorithm. As
CP-ABE is asymmetric algorithm, encryption efficiency is very low, the symmetric
algorithm time cost can be negligible. Decryption algorithm include decryption time
cost of key ciphertext CK and decryption of ciphertext, in fact time cost of
decryption is much smaller than encryption.
3. Key distribution time cost. This means the time cost that UM and DM publish all
user’s private keys, identity of a keys, attribute keys, this CP-ABE based keys
distribution need transform access control policy to access control tree T. In general,
access policies conversion is simply pretreatment, time cost is relatively small.
4. Right revocation time cost. Revocation operation is re-encryption operation of file F
and key k, its process is: data owner retrieve the affected ciphertext Ek(F) and CK
and decryption, then use a new key k’ re-encrypt the file F and figure out ciphertext
Ek’(F). After that, build a new access control structure T’, and re-encrypt k’. The
new ciphertext Ek’(F) and CK’ will be updated to cloud storage, and outdated
ciphertext data will be deleted. The main time cost in revocation operation is data
re-encryption.
The time cost of system initialization and key generation have the maximum cost
for the reason of including bilinear operation and power operation. To improve the
efficiency of the system, using local proxy can greatly improve the efficiency of cloud
storage access, the advantages are as follows:
• Take full advantage of existing computing resources and storage resources of
enterprise or community. The local proxy can be established using their existing
equipment, so can save costs and avoid equipment idle.
• Protect data sharing security. Local proxy can be considered fully credible, and can
achieve compulsive access control policy or implement ciphertext access control for
protection of data sharing security.
• Enhance cloud storage access efficiency. Local proxy can be used to complete the
operation of system initialization and key generation that take larger time cost in
ciphertext access control, and cache frequently used data, reduce the frequency of
cloud storage access, and improve the efficiency of cloud storage access.
• Protect sensitive data. Corporate data involves sensitive content and need to be
encrypted before upload to cloud, furthermore, part of the data can be stored in local
storage to avoid critical data disclosure when published to the cloud storage.
To sum up, using local proxy is an effective way to optimize the efficiency of cloud
storage access.
geographically dispersed. We assume that each local proxy implement writing opera-
tion in its own exclusive space, but data can be read by other agents. The main
functions of local proxy design are as follows:
• Cloud storage access. Users interact with local proxy which replace users to
implement encryption and decryption operations, and interact with cloud storage
service, moreover, local proxy are also responsible for uploading encrypted files
and download data, users no longer directly access cloud storage.
• Ciphertext access control. In our scheme, initialization operation in RM, user
management in UM, and key generation and distribution in DM can be imple-
mented by local proxy, and local proxy can achieve the Ciphertex access control
policy based on hierarchical CP-ABE. This can effectively reduce the impact of
inefficiency.
• Local data cache. Local proxy cache can set buffer which can cache high-frequency
data, and reduce the frequency of cloud storage access.
Local proxy is mainly composed of three parts: storage interface, data processing
services and data storage services, Fig. 3 shows its basic structure below.
In Fig. 3, the object storage interface which provided by LP for data access, is up to
standard with cloud management interface CDMI [12], support PUT, GET, Delete,
FTP, NFS and other operations. In addition, we also designed a permission configure
interface to support specified data access strategy before uploading.
Metadata Management
Local Storage
Cloud Storage Services
The main function of Data Processing Services is to interact with user and receive
user data, put data to the lower data storage services, and publish data to cloud.
Furthermore, it receives user’s instructions and gets response data from Data Storage
Services, then put them to user. Data Processing Services include Metadata Manage-
ment, Uniform Resource Management, Data Encryption and Decryption, Ciphertext
Access Control Management and other functions.
LP generate meta data for each user data object, which include the information of
data object identifier, data access policy, access time, the state of data objects, data
location, key location and ciphertext location etc., also contains LP identifier which is
used to distinguish released data form different proxy. URM includes resource
scheduling module and duplicate management module, which is mainly responsible for
the application and dispatch of resources. Data Encryption and Decryption is mainly
used for data confidentiality protection, we use AES-128 algorithm in the scheme.
Ciphertext Access Control Management can complete main work of access control
based on hierarchical CP-ABE, including system initialization, user management, key
generation, key distribution and other operations, it can take on most of overhead when
users use cloud storage services, and can improve access efficiency.
Data Storage Services is responsible for manage local data cache and data storage in
cloud. Local cache use LP itself storage, that to store higher frequency access data objects
by using principle of the most recently used priority. Cloud Storage Management is
mainly responsible for the storage of data stored in the cloud storage services, including
upload/download data, optimizing data layout and other means to reduce user costs.
Let local proxy be LP1, LP2,…, LPm, m represents the total number of local proxy,
let LPj.urls be the data published by LPj, let data publishing as an example, the
algorithm of data processing by local proxy are shown as Algorithm 1 below.
98 S. Zhu et al.
5 Conclusion
In this paper we proposed solution about protection of data privacy when using cloud
storage service from the user point of view. On the one hand, we proposed a scheme of
ciphertext access control policy based on hierarchical CP-ABE which increase hier-
archical structure in access right distribution and precise identity information in attri-
bute distribution, the use of both precise identity and attributes at the same time can
efficiently implement fine-grained access control, and support for key generation in
hierarchy structure, this scheme can be more suitable for actual scenarios in cloud
storage services. On the other hand, we proposed an efficiency access control
100 S. Zhu et al.
optimizing technique based on local agency, which can implement data privacy pro-
tection. We showed the basic structure of local agents and data processing etc.
experiments show that the scheme can effectively reduce the impact of using ciphertext
access control mechanism and improve the access efficiency of cloud storage. Further
research is the integrity protection of ciphertext, replication policy, access right revo-
cation and optimization techniques that affect the efficiency of cloud storage access.
References
1. iResearch: China Cloud Storage Industry and User Behavior Research Report. http://report.
iresearch.cn/1763.html
2. Borgmann, M., Hahn, T., Herfert, M.: On the security of cloud storage services. http://www.
sit.fraunhofer.de/content/dam/sit/en/studies/Cloud-Storage-ecurity_a4.pdf
3. Shi, X.: The Core of Cloud Storage is Information Security, pp. 107–108. China Information
Security, Beijing (2013)
4. Mather, T., Kumaraswamy, S., Latif, S.: Cloud Security and Privcy. O’Reilly, Media, Inc.,
Houston (2009)
5. Kamara, S., Lauter, K.: Cryptographic cloud storage. In: Sion, R., Curtmola, R., Dietrich, S.,
Kiayias, A., Miret, J.M., Sako, K., Sebé, F. (eds.) RLCPS, WECSR, and WLC 2010. LNCS,
vol. 6054, pp. 136–149. Springer, Heidelberg (2010)
6. Zhang, R., Chen, P.: A dynamic cryptographic access control scheme in cloud storage
services. In: Proceedings of the 8th International Conference on Computing and Networking
Technology, pp. 50–55. IEEE Press, New York (2012)
7. Lv, Z., Zhang, M., Feng, D.: Cryptographic access control scheme for cloud storage. Jisuanji
Kexue yu Tansuo, pp. 835–844. Computer Research and Development, Beijing (2011)
8. Chase, M.: Multi-authority attribute based encryption. In: Vadhan, S.P. (ed.) TCC 2007.
LNCS, vol. 4392, pp. 515–534. Springer, Heidelberg (2007)
9. Waters, B.: Ciphertext-policy attribute-based encryption: an expressive, efficient and
provably secure realization. In: Proceedings of the 14th International Conference on Practice
and Theory in Public Key Cryptography, pp. 53–70. Taormina, Italy (2011)
10. Goyal, V., Pandey, O., Sahai, A.: Attribute-based encryption for fine-grained access control
of encrypted data. In: Proceedings of the 13th ACM Conference on Computer and
Communications Security, pp. 89–98. ACM Press, New York (2006)
11. Yu, S., Wang, C., Ren, K.: Achieving secure, scalable and fine-grained data access control in
cloud computing. In: Proceedings of the IEEE INFOCOM 2010, pp. 19. IEEE Press, New
York (2010)
12. SNIA: Cloud Data Management Interface (CDMI). http://snia.org/sites/default/files/CDMI%
20v1.0.2.pdf
13. Bethencourt, J., Sahai, A., Waters, B.: Advanced crypto software collection ciphertext–
policy attribute–based encryption. http://acsc.cs.utexas.edu/cpabe/
14. Plank, J.S., Simmerman, S., Schuman, C.D.: Jerasure: A library in C/C++ facilitating erasure
coding for storage applications – version 1.2. http://web.eecs.utk.edu/*plank/plank/papers/
CS-08-627.html
Complete Separable Reversible
Data Hiding in Encrypted Image
1 Introduction
Data hiding refers to technology that is used to embed additional data into multimedia
and can be divided into non-reversible [1, 2] and reversible categories [3–10]. Rev-
ersible data hiding can be achieved mainly based on lossless compression [3], integer
transform [4], difference expansion (DE) [5] and histogram shifting (HS) [6–8]. All of
these methods have good embedding efficiency for plaintext images and can also be
applied to JPEG images [9, 10].
As a typical SPED (signal processing in the encrypted domain [11]) topic, RDHEI
means embedding additional data into encrypted images, and has the reversibility
feature of being able to extract the additional data and recover the original image. Since
there is good potential for practical applications including encrypted image authenti-
cation, content owner identification, and privacy protection, RDHEI has attracted more
and more attention from many researchers [12–20].
© Springer International Publishing Switzerland 2015
Z. Huang et al. (Eds.): ICCCS 2015, LNCS 9483, pp. 101–110, 2015.
DOI: 10.1007/978-3-319-27051-7_9
102 Y. Zhaoxia et al.
In [15], an image is encrypted by a stream cipher and the data hider can embed
additional data by flipping the 3 LSB (least significant bits) of pixels. Hong et al. [16]
improve on this with side block matching and smoothness sorting. This year, Liao and
Shu proposed an improved method [17] based on [15, 16]. A new, more precise
function was presented to estimate the complexity of each image block and increase the
correctness of data extraction/image recovery. However, in all of the methods men-
tioned above [15–17], data can only be extracted after image decryption. To overcome
this problem, a separable RDHEI is proposed [18]. A legal receiver can choose 3
different options depending on the different keys held: extracting only the embedded
data with the data hiding key, decrypting an image very similar to the original with the
content owner key, or extracting both the embedded data and recovering the original
image with both of the keys. Recently, another separable method based on pixel
prediction was proposed in [19]. In the data hiding phase, a number of individual pixels
are selected using a pseudo-random key, and additional bits are hidden in the two most
significant bits. However, as the payload increases, the error rate also increases. Yin
et al. [20] offer high payload and error-free data extraction by introducing
multi-granularity permutation, which does not change the image histogram. However,
leakage of the image histogram is inevitable under exhaustive attack. Moreover, in all
of the methods discussed above [18–20], the embedded data can only be extracted
before image decryption. That means that a legal receiver who has the data hiding key
and the decrypted image cannot extract the embedded data.
To solve this problem, this paper presents a new complete separable RDHEI
method based on RC4 encryption [21] and local histogram modification. Not only can
the proposed method completely satisfy the definition of “separable” [18], but the
embedded data can be extracted error-free both from marked encrypted images (cipher
domain) and directly decrypted images (plaintext domain). However, there is a tradeoff:
there should be no saturation pixels of value 0 or 255 in the image. Since saturation
pixels almost are non-existent in natural images, this is a small concession. Compared
with other state-of-the-art research [18, 19], the proposed method achieves higher
embedding payload, better image quality and error-free image restoration.
2 Proposed Method
elaborated in the following sections. First, we discuss image encryption and decryption
by using RC4 [21] in Sect. 2.1.
Ie ¼ eðI; KÞ ð1Þ
The decryption:
I ¼ dðIe ; KÞ ð3Þ
Since the value of each grayscale pixel pi ranges from 1 to 254, Eq. (4) has a unique
solution. If the solution equals to 0, it can be revised to 254. In this paper, we divide the
original image into non-overlapping blocks I ¼ fBj glj¼1 sized u v at first, where l ¼
n=ðu vÞ. Then all the pixels in each block can be encrypted with the same kj . Thus,
each encrypted block Bej keeps structure redundancy to carry additional data.
uv2 l
The data hider then scans the non-basic pixels ff qj;k gk¼1 gj¼1 (i.e. excluding the
two basis pixels used to determine peak values) to conceal the additional data A.
To do this, if a scanned pixel qj;k is equal to the value of gj;L or gj;R , a bit x extracted
from A is embedded by modifying qj;k to q0j;k according to Eq. (7).
(
qj;k x; qj;k ¼ gj;L
q0j;k ¼ ð7Þ
qj;k þ x; qj;k ¼ gj;R
Equation (7) shows that if a bit of value 0 is to be embedded, the value of the cover
pixel remains unchanged. However, if a value of 1 is to be embedded, then depending
if the value of
qj;k matches that of gj;L or gj;R , the value is modified by 1. Otherwise,
pixels that do not match gj;L or gj;R are either maintained or shifted by one unit using
Eq. (8).
Complete Separable Reversible Data Hiding in Encrypted Image 105
8
> q ; gj;L \qj;k \gj;R
< j;k
q0j;k ¼ qj;k 1; qj;k \gj;L ð8Þ
>
:
qj;k þ 1; qj;k [ gj;R
In Eq. (8), it can be seen that if qj;k is between the peak values, then it remains
unchanged, however, if qj;k is below gj;L , then it is shifted by −1, and by +1 if above gj;R .
0 0
The resulting embedded blocks then make up the final embedded image Ie ¼ fBje glj¼1 .
Please note that to make sure the embedded data can be extracted both from the
cipher domain and the plaintext domain, not all of the encrypted blocks are applicable
uv2
to carry data. The smoothness of an encrypted block Bej ¼ f^qj;L ; ^qj;R ; qj;k gk¼1 is
evaluated by the difference value between the minimal pixel and the maximum pixel.
If it is not more than a preset threshold T, as shown in Eq. (9), the block is appropriate
to embed data and accordingly a value of ‘1’ is appended to the location map vector H.
Otherwise, ‘0’ is appended to H.
uv2
maxf^qj;L ; ^qj;R ; qj;k gk¼1 uv2
minf^qj;L ; ^qj;R ; qj;k gk¼1 T ð9Þ
I 0 ¼ dðIe0 ; KÞ
ð10Þ
¼ ðIe0 1 KÞ mod 254
To extract data from fB00j glj¼1 , we consider the non-basic pixels fq00j;k gk¼1
uv2
in each
block. However, it is important to note that we already know the location of basic
pixels by the seed Sd from Kd ¼ fu; v; Sd ; Hg, and so these pixels are left untouched.
The embedded data can be extracted from each block B00j ¼ f^qj;L ; ^qj;R ; q00j;k gk¼1
uv2
using
00
Eq. (11). Essentially, this means that if the value of non-basic pixel qj;k is equal to
either peak, then it is assumed that data is embedded, and therefore a ‘0’ is extracted.
If the value is equal to either gj;L 1 or gj;R þ 1, then a ‘1’ is extracted.
106 Y. Zhaoxia et al.
(
0; q00j;k ¼ gj;L or q00j;j ¼ gj;R
x¼ ð11Þ
1; q00j;k ¼ gj;L 1 or q00j;k ¼ gj;R þ 1
To evaluate RDHEI, there are 4 well known and widely used key indicators: payload,
quality of the directly decrypted image, number of incorrectly extracted bits and the
reversibility of the original image (error rate). In this section, we conduct a number of
different experiments to evaluate the performance of the proposed algorithms. We
firstly show the performance of image encryption. Furthermore, the performance of the
proposed RDHEI is analyzed and compared with state-of-the-art alternative approaches
in terms of the payload, image quality and error rate with several commonly used
standard test images.
Fig. 2. Gray-level frequency histograms, showing original images (top row), permutation
encryption by [20] (2nd row), stream cipher approach by [15–19] (3rd row), and RC4 adopted in
our approach (bottom row).
RDHEI proposed by [18] is 8596 bits, about 0.0328 bpp. The payload of the separable
method proposed by [19, 20] is much higher, but neither the image quality of [19] nor
the security of [20] is satisfactory. In order to prove the value of our proposed method,
Fig. 3 shows the PSNR of directly decrypted images generated by Refs. [18, 19] and
the proposed method tested on Lena. All results in Fig. 3 are derived from the best
parameters under a condition that the embedded data can be extracted exactly and the
original image can be recovered error-free. From Fig. 3 we can see that the rate
distortion performance of the proposed scheme is the best.
The final indicator, the reversibility of the original image, is the possibility of
lossless recovery, and its maximum value is 1 (i.e. fully recovered). If a receiver has
both keys, the original image ought to be recovered without error. However, not all
images can be fully recovered in Ref [19]. Tables 1 and 2 show the error rate of image
recovery in [19] and our proposed method. To get the best results, the 8-th bit of the
host pixel is used to embed data in Wu’s method. And we perform the experiment
108 Y. Zhaoxia et al.
Fig. 3. Image quality and payload comparison for Lena, showing the PSNR of directly
decrypted images generated by Refs. [18, 19], and the proposed method.
in each image 100 times with key from 1 to 100 to calculate the mean error rate.
All experimental results show that the error rate of image in the proposed method is
always 0, better than Ref. [19].
Complete Separable Reversible Data Hiding in Encrypted Image 109
4 Conclusion
This paper proposed and evaluated a complete separable framework for reversible data
hiding in encrypted images. The embedded data can be extracted error-free both from
the cipher domain and the plaintext domain. However, the proposed method is not
suitable for images containing saturated pixels. Future work will aim to improve this.
References
1. Hong, W., Chen, T.S.: A novel data embedding method using adaptive pixel pair matching.
IEEE Trans. Inf. Forensics Secur. 7(1), 176–184 (2012). doi:10.1109/tifs.2011.2155062
2. Tian, H., Liu, J., Li, S.: Improving security of quantization-index-modulation steganography
in low bit-rate speech streams. Multimedia Syst. 20(2), 143–154 (2014). doi:10.1007/
s00530-013-0302-8
3. Celik, M.U., Sharma, G., Tekalp, A.M., Saber, E.: Lossless generalized–LSB data embedding.
IEEE Trans. Image Process. 14(2), 253–256 (2005). doi:10.1109/TIP.2004.840686
4. Peng, F., Li, X., Yang, B.: Adaptive reversible data hiding scheme based on integer
transform. Sig. Process. 92(1), 54–62 (2012). doi:10.1016/j.sigpro.2011.06.006
5. Tian, J.: Reversible data embedding using a difference expansion. IEEE Trans. Circ. Syst.
Video Technol. 13(8), 890–896 (2003). doi:10.1109/tcsvt.2003.815962
6. Ni, Z., Shi, Y.Q., Ansari, N., Su, W.: Reversible data hiding. IEEE Trans. Circ. Syst. Video
Technol. 16(3), 354–362 (2006). doi:10.1109/tcsvt.2006.869964
7. Tai, W.L., Yeh, C.M., Chang, C.C.: Reversible data hiding based on histogram modification
of pixel differences. IEEE Trans. Circ. Syst. Video Technol. 19(6), 906–910 (2009). doi:10.
1109/tcsvt.2009.2017409
8. Tsai, P., Hu, Y.C., Yeh, H.L.: Reversible image hiding scheme using predictive coding and
histogram shifting. Sig. Process. 89(6), 1129–1143 (2009). doi:10.1016/j.sigpro.2008.12.
017
9. Zhang, X., Wang, S., Qian, Z., Feng, G.: Reversible fragile watermarking for locating
tempered blocks in JPEG images. Sig. Process. 90(12), 3026–3036 (2010). doi:10.1016/j.
sigpro.2010.04.027
10. Qian, Z., Zhang, X.: Lossless data hiding in JPEG bitstream. J. Syst. Softw. 85(2), 309–313
(2012). doi:10.1016/j.jss.2011.08.015
11. Erkin, Z., Piva, A., Katzenbeisser, S., Lagendijk, R.L., Shokrollahi, J., Neven, G., Barni, M.:
Protection and retrieval of encrypted multimedia content: when cryptography meets signal
processing. EURASIP J. Inf. Secur. 2007, 1–20 (2007). doi:10.1155/2007/78943
12. Schmitz, R., Li, S., Grecos, C., Zhang, X.: Towards robust invariant commutative
watermarking-encryption based on image histograms. Int. J. Multimedia Data Eng. Manage.
5(4), 36–52 (2014). doi:10.4018/ijmdem.2014100103
110 Y. Zhaoxia et al.
13. Ma, K., Zhang, W., Zhao, X., Yu, N., Li, F.: Reversible data hiding in encrypted images by
reserving room before encryption. IEEE Trans. Inf. Forensics Secur. 8(3), 553–562 (2013).
doi:10.1109/tifs.2013.2248725
14. Zhang, W., Ma, K., Yu, N.: Reversibility improved data hiding in encrypted images. Sig.
Process. 94, 118–127 (2014). doi:10.1016/j.sigpro.2013.06.023
15. Zhang, X.: Reversible data hiding in encrypted image. IEEE Sig. Process. Lett. 18(4),
255–258 (2011). doi:10.1109/lsp.2011.2114651
16. Hong, W., Chen, T.S., Wu, H.Y.: An improved reversible data hiding in encrypted images
using side match. IEEE Sig. Process. Lett. 19(4), 199–202 (2012). doi:10.1109/lsp.2012.
2187334
17. Liao, X., Shu, C.: Reversible data hiding in encrypted images based on absolute mean
difference of multiple neighboring pixels. J. Vis. Commun. Image Represent. 28, 21–27
(2015). doi:10.1016/j.jvcir.2014.12.007
18. Zhang, X.: Separable reversible data hiding in encrypted image. IEEE Trans. Inf. Forensics
Secur. 7(2), 826–832 (2012). doi:10.1109/tifs.2011.2176120
19. Wu, X., Sun, W.: High-capacity reversible data hiding in encrypted images by prediction
error. Sig. Process. 104, 387–400 (2014). doi:10.1016/j.sigpro.2014.04.032
20. Yin, Z., Luo, B., Hong, W.: Separable and error-free reversible data hiding in encrypted
image with high payload. Sci. World J. 2014, 1–8 (2014). doi:10.1155/2014/604876
21. Ferguson, N., Schneier, B.: Practical Cryptography. Wiley, New York (2003)
Multi-threshold Image Segmentation Through
an Improved Quantum-Behaved Particle
Swarm Optimization Algorithm
1 Introduction
Image segmentation plays a critical role in the process of object recognition in digital
image processing. Nowadays, the methods of image segmentation have already been
proposed up to thousands [1]. Among them, threshold method is a basic and popular
technique. By setting one and more thresholds based on the image histogram, we can
divide the pixels in a certain threshold interval to the same object [2]. The traditional
method for calculating threshold includes the maximum between-cluster variance
method [3], the maximum entropy method [4], and the minimum error method [5], etc.
Among them, the maximum between-cluster variance method is the most widely used
method because of its simple calculation and good segmentation performance.
g ¼ x1 x2 ðl1 l2 Þ2 ð1Þ
When g obtains the maximum value under the threshold t, t is the best suitable
threshold value. Extending OTSU to multi-threshold segmentation, we can get the
computing formula as (2).
X
M
g¼ xk ðlk lÞ2 ð2Þ
k¼1
Where M is the number of best threshold values. μ is the average gray level of the
image. The variance g obtains the maximum value under the best threshold T* = {t1,
t2… tM-1}, namely T ¼ Argmaxfgg.
The calculation of multi-threshold has a heavy computation burden, the time per-
formance is low in practice. In essence, the calculation of multi-threshold (formula (2))
can be reduced to multi-objective optimization problem. So the multi-objective opti-
mization theory and method can be used on this issue. At present, previous studies
show that the quantum particle swarm optimization algorithm is an effective method to
solve the problem of this class.
QPSO (quantum-behaved particle swarm optimization) algorithm is a swarm
intelligence algorithm, which is proposed by Sun in 2004 from the perspective of
quantum behavior [6, 7]. It is the improved version of the classical PSO (particle swarm
optimization) [8]. Because the uncertainty of the quantum system determines the
quantum particles could appear within the feasible region at a certain probability,
QPSO algorithm has advantages of global optimization convergence compared to PSO
algorithm in theory [9]. In addition, QPSO algorithm has less control parameters,
simple implementation, and fast convergence speed. But in practice, for some com-
plicated problems, it has imperfections: species diversity decreased quickly, trapping in
local optimum sometimes and slow convergence speed because of defective parameters
setting. Therefore, in recent years, researchers have proposed a lot of approaches to
overcome these drawbacks, such as improvement of the parameters setting of
contraction-expansion [10, 11], the introduction of selection and mutation operation
[12], and the introduction of the strategy of multi-population co-evolution [13–15], etc.
Based on the understanding of prior works, we proposed an algorithm named as
IQPSO (improved quantum-behaved particle swarm optimization) to address these
imperfections. Through the application of IQPSO in multi-threshold image segmen-
tation, the experiments results showed that IQPSO was a more effective way to solve
this problem than QPSO.
Multi-threshold Image Segmentation 113
u1 Pji;t þ u2 Pjg;t
Oji;t ¼ ð6Þ
u1 u2
Lji;t ¼ 2aCjt xji;t ð7Þ
114 W. Jiali et al.
PN
i¼1 Pji;t
Cji ¼ ð8Þ
N
t
a ¼ amax ðamax amin Þ ð9Þ
T
3 IQPSO
Aiming at the shortcomings of QPSO, we improve the algorithm performance from the
following two aspects.
pjb;t pjc;t
Dji;t ¼ ð12Þ
2
116 W. Jiali et al.
Combined with the new calculation of the attractive point, the potential well length
L is modified as formula (13).
Lji;t ¼ 2aDji;t ð13Þ
t
q ¼ qmax ðqmax qmin Þ ð14Þ
T
8
< xj
i;t þ 1 ; if randj ð0; 1Þ\pc or j ¼ jrand
Zji;t ¼ ð15Þ
: pji;t ; otherwise
Step 3: after the generation of Zi,t+1, the historical optimal position Pi,t+1 is updated
by formula (16).
(
Zi;t þ 1 ; if fðZi;t þ 1 Þ [ fðPi;t Þ
Pi;t þ 1 ¼ ð16Þ
Pi;t ; otherwise
randj (0, 1) is a uniformly distributed random number in the interval [0, 1].
jrand is a uniformly distributed random integer in [1, d].
f (·) is the fitness function.
pc is the crossover probability. In this article, pc is directly encoded into each
particle in order to realize adaptive control. After coding, the ith particle is described as
follows.
Xi;t ¼ x1i;t ; x2i;t . . .; xdi;t ; pci;t ð17Þ
• Generate the measurement position according to the formula (15), and evaluate the
value of the fitness function. According to the formula (16), update the historical
optimal position Pi, t and the community optimal location Pg, t.
• According to the formula (18), update the crossover probability.
• Check the termination conditions: if the algorithm has good fitness function value or
reach the maximum number of iterations, the iteration stops. Otherwise, return to
step 4).
When the algorithm is ended, Pg, t is the value of optimal thresholds.
M 2 3 4
threshold fitness threshold fitness threshold fitness
image value value value value value value
a 90,180 3054.4956 72,125,191 5608.1314 40,88,132,195 8706.0967
b 109,189 3733.8552 81,122,194 7276.9442 77,109,140,200 11879.7161
c 72,144 1254.0702 66,126,182 2528.4884 63,117,163,222 4512.9173
time 1331.0761 158305.7212 12073135.52
(ms)
120 W. Jiali et al.
largest fitness function values, which are equal or most close to the results of OTSU
algorithm. Therefore, we can draw the following two conclusions: (1) compared with
OTSU algorithm, IQPSO algorithm greatly shortens the time of obtaining the threshold
value. (2) Compared with the rest two optimization algorithms PSO and QPSO, the
segmentation effect of IQPSO is better.
Experiment 2: the segmentation effect of OTSU algorithm and IQPSO algorithm.
When the threshold number is 4, the effect of image segmentation using IQPSO and
OTSU is showed in Fig. 2. It is observed that the segmentation results of IQPSO
algorithm are same as the results of OTSU method with naked eyes.
From the analysis of the above experiment results, we can draw the following
conclusion: the IQPSO algorithm can improve the efficiency of multi-threshold image
segmentation on the premise of achieving the accurate value.
Multi-threshold Image Segmentation 121
5 Conclusion
Acknowledgments. This work is supported by the open fund of the key laboratory in Southeast
University of computer network and information integration of the ministry of education (Grant
No. K93-9-2015-10C).
122 W. Jiali et al.
References
1. Xiping, L., Jiei, T.: A survey of image segmentation. Pattern Recogn. Artif. Intell. 12(3),
300–312 (1999)
2. Pal, N.R., Pal, S.K.: A review on image segmentation techniques. Pattern Recogn. 26(9),
1277–1294 (1993)
3. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man
Cybern. 9(1), 62–66 (1979)
4. Kapur, J.N., Sahoo, P.K., Wong, A.K.C.: A new method for gray-level picture thresholding
using the entropy of the histogram. Comput. Vision Graph. Image Process. 29(3), 273–285
(1985)
5. Kittler, J., Illingworth, J.: Minimum error thresholding. Pattern Recogn. 19(1), 41–47 (1986)
6. Sun, J., Feng, B., Xu, W.-B.: Particle swarm optimization with particles having quantum
behavior. In: Proceedings of 2004 Congress on Evolutionary Computation, pp. 325–331.
Piscataway, NJ (2004)
7. Sun, J., Xu, W.-B., Feng, B.: A global search strategy of quantum-behaved particle swarm
optimization. In: Proceedings of 2004 IEEE Conference on Cybernetics and Intelligent
Systems, pp. 111–115. Singapore (2004)
8. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proceedings of the 1995 IEEE
International Conference on Neural Networks (1995)
9. Sun, J., Wu, X.J., Palade, V., Fang, W., Lai, C.-H., Xu, W.: Convergence analysis and
improvements of quantum-behaved particle swarm optimization. Inf. Sci. 193, 81–103 (2012)
10. Sun, J., Xu, W.-B., Liu, J.: Parameter selection of quantum-behaved particle swarm
optimization. In: Wang, L., Chen, K., S. Ong, Y. (eds.) ICNC 2005. LNCS, vol. 3612,
pp. 543–552. Springer, Heidelberg (2005)
11. Cheng, W., Chen, S.F.: QPSO with self-adapting adjustment of inertia weight. Comput. Eng.
Appl. 46(9), 46–48 (2010)
12. Gong, S.-F., Gong, X.-Y., Bi, X.-R.: Feature selection method for network intrusion based
on GQPSO attribute reduction. In: International Conference on Multimedia Technology,
pp. 6365–6358 (26–28 July 2011)
13. Gao, H., Xu, W.B., Gao, T.: A cooperative approach to quantum-behaved particle swarm
optimization. In: Proceedings of IEEE International Symposium on Intelligent Signal
Processing, IEEE, Alcala de Henares (2007)
14. Lu, S.F., Sun, C.F.: Co evolutionary quantum-behaved particle swarm optimization with
hybrid cooperative search. In: Proceedings of Pacific-Asia Workshop on Computational
Intelligence and Industrial Application, IEEE, Wuhan (2008)
15. Lu, S.F., Sun, C.F.: Quantum-behaved particle swarm optimization with cooperative-
competitive co evolutionary. In: Proceedings of International Symposium on Knowledge
Acquisition and Modeling, IEEE, Wuhan (2008)
16. Clerk, M., Kennedy, J.: The particle swarm-explosion, stability, and convergence in a
multidimensional complex space. IEEE Trans. Evol. Comput. 6(1), 58–73 (2002)
17. Sun, J., Xu, W.B., Feng, B.: Adaptive parameter control for quantum-behaved particle
swarm optimization on individual level. In: Proceedings of 2005 IEEE International
Conference on Systems, Man and Cybernetics, pp. 3049–3054. Piscataway (2005)
18. Brest, J., Greiner, S., Boskovic, B., et al.: Self adapting control parameters in differential
evolution: a comparative study on numerical benchmark problems. IEEE Trans. Evol.
Comput. 10(6), 646–657 (2006)
19. http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/
Coverless Image Steganography
Without Embedding
1 Introduction
In the past two decades, many image steganography methods have been proposed
in the literature [2–12]. To our knowledge, all of these steganography methods des-
ignate an appropriate cover image, and then embed secret information into it to gen-
erate the stego-image. According to reference [1], the existing image steganography
methods can be roughly divided into three categories: spatial domain, frequency
domain and adaptive methods. The spatial domain methods have a larger impact than
the other two categories. Typical spatial domain methods include the LSB replace-
ment [2], LSB matching [3], color palette [4] and histogram-based methods [5].
Although the modification of cover image LSB caused by the embedding mechanism
of the spatial domain methods is not easily detected by human eyes, the embedded
information is sensitive to image attacks. To address this issue, many frequency domain
steganography methods have been proposed, such as quantization table (QT) [6],
discrete Fourier transform (DFT) [7], and discrete wavelet transform (DWT) based
embedding [8]. Adaptive steganography is a special case of the spatial domain and
frequency domain methods. In the literature, there are many typical adaptive
steganography methods, such as the locally adaptive coding-based [9], edge-based [10]
and Bit Plane Complexity Segmentation (BPCS) based data embedding [11, 12].
Although the existing methods employ different technologies for image
steganography, all of them have a common point. That is all of them implement the
steganography by embedding the secret information into a designated cover image.
Since the embedding process will modify the content of the cover image more or less,
modification traces will be left in the cover image. Consequently, it is possible to
successfully detect the steganography by various emerging steganalysis tools such as
[13–16], all of which are based on the modification traces. If we find a novel hiding
manner by which the secret information can be hidden without any modification, it will
be effective to resist all of the existing steganalysis tools. How to hide the secret
information without any modification? It is a challenging and exciting task.
As we know, it is a fact that any original image contains a lot of information. This
information may already contain the secret data needed to be hidden with a certain
probability. From Fig. 1, for the secret data f1; 1; 0; 1; 1; 0; 1; 0g, we found that the
original image shown in the right of Fig. 1 already contains a pixel of which the
intensity value is 218. The binary data of the intensity value 218 is f1; 1; 0; 1; 1; 0;
1; 0g, which is the same as the secret data. Therefore, if we find the appropriate original
images which already contain the secret data as stego-images and communicate the
{1 1 0 1 1 0 1 0} 218
Fig. 1. The original image which already contains the secret data f1; 1; 0; 1; 1; 0; 1; 0g
Coverless Image Steganography Without Embedding 125
secret data by transmitting these images, it can implement the steganography without
any modification to these images.
Without employing the designated cover image for embedding the secret data, this
paper proposes a novel image steganography framework, called coverless image
steganography, to hide the secret data by finding appropriate images which already
contain it. These images are regarded as stego-images, which are used for communi-
cation of the secret data. For a given secret data, the main issue of our framework lies in
how to find these original images. As shown in Fig. 1, directly finding the images
having the pixels of which intensity values are the same as secret data is simple, but it is
not a robust manner. That is because, during communication of the secret data, these
images may suffer some image attacks, which will cause the intensity values of the
pixels in the images to be changed. Thus, instead of using the above manner, we find
the images whose hash sequences generated by a robust hashing algorithm are the same
as the secret data. In this framework, a series of appropriate original images are chosen
from a constructed database by using a robust hashing algorithm. Firstly, a number of
images are collected from the networks to construct a database, and these images are
indexed according to their hash sequence generated by a robust hashing algorithm.
Then, the secret data is transformed to a bit string and divided it into a number of
segments with the equal length. To implement the information hiding, the images of
which hash sequences are the same as the segments are chosen from the database as
stego-images.
The main contributions of this paper are concluded as follows. Since we do not
need the designated cover image for embedding secret data and thus any modification
traces will not be left, the proposed coverless image steganography framework can
resist all of the existing steganalysis tools. Moreover, this framework is robust to the
typical image attacks, such as rescaling, luminance change, contrast enhancement,
JPEG compression and noise adding, owing to the robust hashing algorithm.
The rest of this paper is organized as follows. Section 2 presents the proposed
coverless image steganography framework. Section 3 analyzes the resistance to ste-
ganalysis tools and robustness of the proposed framework. Experiments are presented
in Sect. 4, and conclusions are drawn in Sect. 5.
The flow chart of the proposed coverless steganography framework for secret data
communication is shown in Fig. 2.
In the proposed framework, an image database is first constructed by collecting a
number of images from the networks. Then, for each image in the database, its hash
sequence is generated by a robust hashing algorithm. Afterward, all of these images are
indexed according to their hash sequences to build an inverted index structure. Note
that the hashing algorithm is shared between sender and receiver.
To communicate the secret data, the sender first transforms the secret data to a bit
string and divides it into a number of segments with the equal length. Then, for each
segment, the image of which hash sequence is the same as the segment is found by
searching in the inverted index structure. Afterward, a series of images associated to the
126 Z. Zhou et al.
segments, which can be regarded as stego-images, are obtained and then transmitted to
the receiver. On the receiver end, the hash sequences of these received images are
generated by the same hashing algorithm. Since these hash sequences are the same as
the segments of the secret data, the receiver can concatenate them to recover the
secret data.
According to the above, the main components of this framework for secret data
communication include hash sequence generation by robust hashing algorithm, con-
struction of the inverted index structure by image indexing, finding appropriate images
by searching for the segments in the inverted index structure, and the communication of
the secret data. Each component is detailed as follows.
Image
Database
Robust hashing
Hash sequences
indexing
Index structure
Sender receiver
finding
divide concatenate
The secret data The segments The segments The secret data
...
...
Fig. 2. The flow chart of the proposed steganography framework for secret data communication
{1,1,1,1,0,1,1,1}
Fig. 3. The procedure of hash sequence generation by the robust hashing algorithm
adjacent Ii þ 1 by Eq. (1) to generate the hash sequence of the image fh1 ; h2 ; . . .; h8 g.
Figure 3 shows the procedure of the hash sequence generation by the robust hashing
algorithm.
(
hi ¼ 1; if Ii Ii þ 1
; where 1 i 8 ð1Þ
hi ¼ 0; otherwise
It is worth noting that there should be at least one image in each list of table T. This
ensures that at least one corresponding image can be found in each list of the table for
any 8-bit segment, which has 28 ¼ 256 possibilities. Thus, the size of the database is no
less than 256.
128 Z. Zhou et al.
This section will analyze the reliability of our steganography method in two aspects:
resistance to steganalysis tools and robustness to image attacks.
employing a designated cover image for embedding the secret data, our coverless
steganography method directly finds appropriate original images which already contain
the secret data as stego-images, and thus any modification traces will not be left in the
stego-images.
4 Experiments
In this section, we will test the resistance to the steganalysis tools and the robustness to
the image attacks of our method, and compare with the other steganography methods.
Total of 1000 images downloaded from the networks are used to build our database. To
test the performance of our method, we randomly select 200 images from our database,
and then hide 8-bit secret data in each of them. Two famous steganography methods,
namely LSB replacement and LSB matching, are compared with our method.
130 Z. Zhou et al.
0.8
Probability of detection
0.6
0.4
all of those attacks, compared with the other methods. That implies our method is more
robust that the other methods.
5 Conclusion
References
1. Cheddad, A., Condell, J., Curran, K., Mc Kevitt, P.: Digital image steganography: survey
and analysis of current methods. Sig. Process. 90(3), 727–752 (2010)
2. Wu, H.C., Wu, N., Tsai, C.S., Hwang, M.S.: Image steganographic scheme based on
pixel-value differencing and LSB replacement methods. In: Proceedings-Vision Image and
Signal Processing, vol. 152, pp. 611–615 (2005)
3. Mielikainen, J.: LSB matching revisited. IEEE Sig. Process. Lett. 13(5), 285–287 (2006)
4. Johnson, N.F., Jajodia, S.: Exploring steganography: seeing the unseen. IEEE Comput.
31(2), 26–34 (1998)
132 Z. Zhou et al.
5. Li, Z., Chen, X., Pan, X., Zheng, X.: Lossless data hiding scheme based on adjacent pixel
difference. In: Proceedings of the International Conference on Computer Engineering and
Technology, pp. 588–592 (2009)
6. Li, X., Wang, J.: A steganographic method based upon JPEG and particle swarm
optimization algorithm. Inf. Sci. 177(15), 3099–3109 (2007)
7. McKeon, R.T.: Strange Fourier steganography in movies. In: Proceedings of the IEEE
International Conference on Electrio/information Technology (EIT), pp. 178–182 (2007)
8. Chen, W.Y.: Color image steganography scheme using set partitioning in hierarchical trees
coding, digital Fourier transform and adaptive phase modulation. Appl. Math. Comput.
185(1), 432–448 (2007)
9. Chang, C.C., Kieu, T.D., Chou, Y.C.: Reversible information hiding for VQ indices based
on locally adaptive coding. J. Vis. Commun. Image Represent. 20(1), 57–64 (2009)
10. Luo, W., Huang, F., Huang, J.: Edge adaptive image steganography based on LSB matching
revisited. IEEE Trans. Inf. Forensics Secur. 5(2), 201–214 (2010)
11. Kawaguchi, E.: BPCS-steganography – principle and applications. In: Khosla, R., Howlett, R.J.,
Jain, L.C. (eds.) KES 2005. LNCS (LNAI), vol. 3684, pp. 289–299. Springer, Heidelberg (2005)
12. Hioki, H.: A data embedding method using BPCS principle with new complexity measures.
In: Proceedings of Pacific Rim Workshop on Digital Steganography, pp. 30–47 (2002)
13. Ker, A.D.: Improved detection of LSB steganography in grayscale images. In: Fridrich,
J. (ed.) IH 2004. LNCS, vol. 3200, pp. 97–115. Springer, Heidelberg (2004)
14. Ker, A.: Steganalysis of LSB matching in grayscale images. IEEE Sig. Process. Lett. 12(6),
441–444 (2005)
15. Xia, Z.H., Wang, X.H., Sun, X.M., Liu, Q.S., Xiong, N.X.: Steganalysis of LSB matching
using differences between nonadjacent pixels. Multimedia Tools Appl. (2014). doi:10.1007/
s11042-014-2381-8
16. Xia, Z.H., Wang, X.H., Sun, X.M., Wang, B.W.: Steganalysis of least significant bit
matching using multi-order differences. Secur. Commun. Netw. 7(8), 1283–1291 (2014)
17. Petitcolas, F.: Watermarking schemes evaluation. IEEE Sig. Process. Mag. 17(5), 58–64
(2000)
Coverless Information Hiding Method
Based on the Chinese Mathematical Expression
Abstract. Recently, many fruitful results have been presented in text infor-
mation hiding such as text format-based, text image-based method and so on.
However, existing information hiding approaches so far have been very difficult
to resist the detecting techniques in steganalysis based on statistical analysis.
Based on the Chinese mathematical expression, an efficient method of coverless
text information hiding is presented, which is a brand-new method for infor-
mation hiding. The proposed algorithm directly generates a stego-vector from
the hidden information at first. Then based on text big data, a normal text that
includes the stego-vector will be retrieved, which means that the secret messages
can be send to the receiver without any modification for the stego-text. There-
fore, this method is robust for any current steganalysis algorithm, and it has a
great value in theory and practical significance.
1 Introduction
Information hiding is an ancient but also young and challenging subject. Utilizing the
insensitivity of human sensory organs, as well as the redundancy of the digital signal
itself, the secret information is hidden in a host signal, which does not affect the effect
on sensory and value in use of the host signal. The host here covers all kinds of digital
carriers such as the image, text, video and audio [1]. Since the text is the most fre-
quently used and extensive information carrier, its research has attracted many
scholars’ interest, and has obtained many results.
There are four main types of text information hiding technologies: the text
format-based, text image-based, generating method-based and embedding method-
based natural language information hiding.
Text format based information hiding method mainly achieves the hiding of
secret information by changing the character spacing, inserting invisible characters
(spaces, tabs, special spaces, etc.) and modifying the format of documents (PDF,
HTML, Office). For example, the [2, 3] hided data via changing the characters such as
the row spacing, word spacing, character height and character width. The [4–8]
embedded data that utilized the programming language to modify certain properties of
the Office document (including NoProofing attribute values, character color attributes,
font size, font type, font underline, Range object, object’s Kerning property, color
properties, etc.). Based on the format information of disk volume, Blass et al. proposed
a robust hidden volume encryption in [9, 10]. The hiding capacity of information
hiding methods based on text format is large, but most of them can’t resist the attack of
re-composing and OCR. They can’t resist the steganography detection based on sta-
tistical analysis ([11, 12]).
The main idea of information hiding method based on text image is to regard a
text as a kind of binary image. Then it combines the features of binary images with
texts to hide data. For example, [13] embedded information by utilizing the parity of
the numbers of black and white pixels in the block, in [14, 15], information was
embedded by modifying the proportion of black-white pixels in a block and the pixel
values of the outer edge, respectively. The embedding of secret information is realized
by the rotation of the strokes of the Chinese characters in [16]. In addition, based on
hierarchical coding, Daraee et al. [17] presented an information hiding method. Satir
et al. [18] designed a text information hiding algorithm based on the compression
method, which improved the embedding capacity. The biggest problem of text image
based information hiding method is that it can’t resist re-composing and OCR attacks.
After re-composing the characters of the hidden information into a non-formatted text,
the hidden information hiding would completely disappear.
Generation method based natural language information hiding method utilizes
the natural language processing (NLP) technologies to carry secret information by
generating the similar natural text content. It can be divided into two types: the primary
text generation mechanism and the advanced text generation mechanism. The former is
based on probability statistics, which is coded by utilizing the random dictionary or the
occurrence frequency of the letter combinations, and then the generated text meets the
natural language of statistical characteristics. The latter is based on the linguistic
approach. Based on linguistic rules, it carries the secret data by using the imitation
natural language text without specific content [19]. In these methods, due to the lack of
artificial intelligence for the automatic generation of arbitrary text, the generation text
always contains the idiomatic mistakes or common sense errors, or sentences without
complete meaning. Moreover, it may cause incoherent semantic context and poor text
readability, which is easily to be recognized by human eyes [20, 21].
Embedding method based natural language information hiding method
embeds the secret information by using different granularity of modification of the text
[22, 23]. According to the scope of the modified text data, the embedding method can
be divided into lexical level information hiding and sentence level information hiding.
The former method hides the messages by means of substitution of similar characters
[24], substitution of spelling mistakes [25], substitution of abbreviations/acronyms and
words in complete form [26], etc. Based on the understanding of the sentence structure
Coverless Information Hiding Method 135
and its semantics, the latter method changes the sentence structure to hide information,
and then utilizes the syntactic transformation and restatement technology in the same
situation of its meaning and style [27–30]. Embedding method is the focus and hotspot
of text information hiding in current research. However, this method needs the support
of natural language processing technology, such as syntactic parsing, disambiguation,
automatic generation, etc., so that the information embedded into the text meets
rationality of words, collocation accuracy, syntactic structure, and the statistical char-
acteristics of language [31]. Because of the limitation of the existing NLP technology,
it is hard to realize the hiding algorithm. In addition, there are still some deviation and
distortion in the statistic and linguistics [32].
From the above we can see that the text information hiding has made many research
results, but there are still some problems such as weak ability in anti-statistical analysis,
bad text rationality and so on. Furthermore, theoretically as long as the carrier is
modified, the secret message will certainly be detected. As long as the secret infor-
mation exists in the cover, it can hardly escape from steganalysis. Thus, the existing
steganography technology is facing a huge security challenge, and its development has
encountered a bottleneck.
The proposed method firstly carries on syntactic parsing about the information to be
hidden and divides it into independent keywords, then uses the Chinese mathematical
expression [33] to create a locating tags. After that, utilizing the cloud search services-
multi-keyword ranked search [34, 35], a normal text containing the secret information
can be retrieved, which achieves the direct transmission of the secret information. It
doesn’t require any other carriers and modifications, while it can resist all kinds of
existing steganalysis methods. This research has an important positive significance for
the development of information hiding technology.
2 Related Works
The Chinese character mathematics expression was proposed by Sun et al. in 2002
[33]. The basic idea is to express the Chinese characters as a mathematical expression
so that the operands are components of Chinese characters and the operators are six
spatial relations of components. Some definitions are given below.
Definition 1. A basic component is composed of several strokes, and it may be a
Chinese character or a part of a Chinese character.
Definition 2. An operator is the location relation between the components. Let A,B be
two components, A lr B, A ud B, A ld B, A lu B, A ru B and A we B represent that A
and B have the spatial relation of left-right, up-down, left-down, left-upper, right-upper,
and whole enclosed respectively. An intuitive explanation of the six operators is shown
in Fig. 1.
Definition 3. The priority of the six operators defined in Definition 3 is as follows: (1).
() is the highest; (2). we, lu, ld, ru are in the middle; (3). lr, ud are the lowest; the
operating direction is from left to right.
136 X. Chen et al.
Using the selected 600 basic components and the six operators in Fig. 1, we can
express all the 20902CJK Chinese characters in UNICODE 3.0 by utilizing the
mathematical expressions. It is very nature and has a simple structure, and every
character can be processed by certain operational rules as general mathematical
expressions. After the expression of Chinese characters into the mathematical symbols,
many processing of the Chinese information will become simpler than before.
According to the Chinese mathematical expression, we can see that if the appro-
priate components are selected as the location label of the secret message, it is better
than that of the word or phrase being selected directly as the index in terms of many
indicators such as randomness, distinguishability and universalness.
3 Proposed Method
Instead of conventional information hiding that needs to search an embedded carrier for
the secret information, coverless information hiding requires no other carriers. It is
driven by the secret information to generate an encryption vector, and then a normal
text containing the encrypted vector can be retrieved from the big data of text, so the
secret message can be embedded directly without any modification.
From the above analysis, there are three characteristics of the coverless information
hiding algorithm: The first one is “no embedding”, that is, a carrier can’t embed secret
information by modifying it. The second is “no additional message need to be trans-
mitted except an original text”, that is, other than the original agreement, there should
not be any other carriers additionally used to send auxiliary information, such as the
details or parameters of the embedding or extraction. The third is “anti-detection”,
which can resist all kinds of the existing detection algorithms. Based on the above
characteristics together with the related theory of the Chinese mathematical expression,
this paper presents a text-based coverless information hiding algorithm.
tags, generate the keywords that contain the converted secret data and the locating tags.
Furthermore, we search the texts that contain the keywords in the database so as to
achieve the information hiding with zero modification.
Let m be a Chinese character, and T be a set of the 20902CJK Chinese characters
in UNICODE 3.0. Suppose the secret message is M = m1 m2 . . .mn , the conversion and
location process of its secret information can be summarized as in Fig. 2. The details
can be introduced as the following:
Where F c ðpi ; Þ is the transformation function used for the statistic and analysis of
the text database, this function is open for all users, and is the private keys of the
information receiver, such as the k of the above formula, and s is the number of
keywords of the text database. Where the difference between the quantities of pi
and p0i is not too much, if not, a commonly-used word will be converted into a
rarely-used word, which will greatly decrease the retrieval efficiency of stego-text.
Therefore, using the F c , the converted keyword wi of wi can be calculated, we
can obtain the converted secret message W ¼ w 1 w 2 . . .wN .
(3) Get the locating tags. For the text database retrieved, divide the Chinese char-
acters into various components by using the Chinese mathematical expression
first, and then calculate the frequency of every component, finally sort the com-
ponents in descending order according to the frequencies of occurrence. Select the
component whose appearance time is in the top 50 and then determine the locating
138 X. Chen et al.
(1) Extraction preprocessing. Because the text database is open to all users, they are
also available as public information of the 50 Chinese components for marking
and the corresponding Chinese characters, denoted b ¼ fbi ji ¼ 1; . . .; 50g and
mi ¼ fmij jj ¼ 1; . . .g respectively. Therefore, utilizing the user’s private key, we
can get the located component and its order of appearance. Moreover, based on
the statistical results, it is also easy to get the Chinese characters L ¼ fLi ji ¼
1; 2; ; N g in the database.
(2) The extraction of candidate keywords. Sequentially scan the stego-text S, then
0
extract the candidate locating tags CL ¼ fCLi ji ¼ 1; 2; N g and the candidate
0
keywords S 0 ¼ fS 0i ji ¼ 1; 2; N g according to the user location components
0
set, where N N .
0
When N ¼ N , then CL is the locating tags of the secret information, skip step
0
(3) to step (4); When N [ N, there exist the non-locating components that are
contained in CL, they should be eliminated from CL with step (3).
(3) Eliminate the redundant tags. The procedure is introduced as follows:
www.allitebooks.com
140 X. Chen et al.
(a) Compare CLi with Li from i ¼ 1, if CLi ¼ Li , update i ¼ i þ 1 and execute the
step (a) again, otherwise skip to step (b);
(b) If CLi1 6¼ CLi , but both of them have the same components, then CLi is not a
location tag, delete it and skip to step (a). If CLi1 6¼ CLi and they have the
different components, then compare the quantity sorting of the two Chinese
characters in text database, and the Chinese character that doesn’t meet the
keys of receiver isn’t the locating tag, then delete it from CL; otherwise skip to
step c);
(c) If CLi1 ¼ CLi , then at least one character isn’t the locating tag between CLi1
and CLi , so combine CLi1 and CLi with its subsequent keywords to generate
the keyword retrieved Di , delete the one with smaller number and skip to the
step (a);
When the correct tags are calculated, locate it in S, and then extract the character
strings S i ði ¼ 1; 2; ; N Þ after the locating points, where jjS i j ¼ ‘;
(4) Since each keyword is divided by the dependency syntax before the hiding, the
length of every keyword ‘wi may not be exactly the same, where 1 ‘wi ‘.
Moreover, because of the words conversion, the keywords cannot be accurately
extracted. Therefore, when using the inverse transform of word conversion
F 1 ðpi ; kÞ to restore the string S i , the obtained candidate keywords set Ki ¼
c j
Ki j1 j ‘ is not unique;
(5) Select a keyword from every Ki ði ¼ 1; 2; N Þ, and generate the candidate
secret messages by researching the language feature and the word segmentation
based on user background, then measure the confidence of the candidate secret
information by analyzing the edit distance and similarity of the keywords, a rank
can then be recommended to the receiver;
(6) Utilize the sorted recommended information, then combine the language analysis
with Chinese grammar features, we can access the secret information M =
m1 m2 . . .mn .
4 Example Verification
Finally, retrieve the text database to find a stego-text which contains the
locating tags and keywords, where using the rules of the above,
is a stego-text with the retrieved secret
information .
In the case of the recipient’s encrypted text information
, because of the absence of redundant
components, the extraction process is the inverse process of the embedding process.
The realization is relatively simple, so we will not repeat them now.
5 Conclusions
This paper presented a text information hiding method, which is based on Chinese
mathematical expression. Instead of the conventional information hiding method that
needs to find an embedding carrier for the secret message, the proposed method
requires no other carriers. First, an encryption vector is generated by the secret
information, and then a normal text containing the encrypted vector is retrieved from
the text database, which realizes embedding directly without any modification of the
secret data. Therefore, the proposed method can resist all kinds of existing steganalysis
methods. This research has an important positive significance for the development of
information hiding technology.
Acknowledgments. This work is supported by the National Natural Science Foundation of China
(NSFC) (61232016, U1405254, 61502242, 61173141, 61173142, 61173136, 61373133), Jiangsu
Basic Research Programs-Natural Science Foundation (SBK2015041480), Startup Foundation for
Introducing Talent of Nanjing University of Information Science and Technology (S8113084001),
Open Fund of Demonstration Base of Internet Application Innovative Open Platform of Department
of Education (KJRP1402), Priority Academic Program Development of Jiangsu Higher Education
Institutions (PADA) Fund, Collaborative Innovation Center of Atmospheric Environment and
Equipment Technology (CICAEET) Fund, National Ministry of Science and Technology Special
Project Research GYHY201301030, 2013DFG12860, BC2013012.
142 X. Chen et al.
References
1. Cox, I.J., Miller, M.L.: The first 50 years of electronic watermarking. J. Appl. Signal
Process. 2, 126–132 (2002)
2. Low, S.H., Maxemchuk, N.F., Lapone, A.M.: Document identification for copyright
protection using centroid detection. IEEE Trans. Commun. 46(3), 372–383 (1998)
3. Brassil, J.T., Low, S.H., Maxemchuk, N.F.: Copyright protection for the electronic
distribution of text documents. Proc. IEEE 87(7), 1181–1196 (1999)
4. Ffencode for DOS (2015). http://www.burks.de/stegano/ffencode.html
5. WbStego4.2 (2015). http://home.tele2.at/wbailer/wbstego/
6. Kwan M. Snow (2015). http://www.darkside.com.au/snow/index.html
7. Koluguri, A., Gouse, S., Reddy, P.B.: Text steganography methods and its tools. Int. J. Adv.
Sci. Tech. Res. 2(4), 888–902 (2014)
8. Qi, X., Qi, J.: A desynchronization resilient watermarking scheme. In: Shi, Y.Q. (ed.)
Transactions on Data Hiding and Multimedia Security IV. LNCS, vol. 5510, pp. 29–48.
Springer, Heidelberg (2009)
9. Blass, E.O., Mayberry, T., Noubir, G., Onarlioglu, K.: Toward robust hidden volumes using
write-only oblivious RAM. In: Proceedings of the 2014 ACM Conference on Computer and
Communications Security (CCS 2014), pp. 203–214 (2014)
10. Mayberry, T., Blass, E.O., Chan, A.H.: Efficient Private file retrieval by combining ORAM
and PIR. In: Proceedings of 20th Annual Network & Distributed System Security
Symposium (NDSS 2014), pp. 1–11 (2014)
11. Goyal, L., Raman, M., Diwan, P.: A robust method for integrity protection of digital data in
text document watermarking. Int. J. Sci. Res. Dev. 1(6), 14–18 (2014)
12. Kwon, H., Kim, Y., Lee, S.: A tool for the detection of hidden data in microsoft compound
document file format. In: 2008 International Conference on Information Science and
Security, pp. 141–146 (2008)
13. Wu, M., Liu, B.: Data hiding in binary images for authentication and annotation. IEEE
Trans. Multimedia 6(4), 528–538 (2004)
14. Zhao, J., Koch, E.: Embedding robust labels into images for copyright protection. In:
Proceedings of the International Congress on Intellectual Property Rights for Specialized
Information, Knowledge and New Technologies, Australia, pp. 242–251 (1995)
15. Xia, Z.H., Wang, S.H., Sun, X.M., Wang, J.: Print-scan resilient watermarking for the
Chinese text image. Int. J. Grid Distrib. Comput. 6(6), 51–62 (2013)
16. Tan, L.N., Sun, X.M., Sun, G.: Print-scan resilient text image watermarking based on stroke
direction modulation for Chinese document authentication. Radioengineering 21(1), 170–
181 (2012)
17. Daraee, F., Mozaffari, S.: Watermarking in binary document images using fractal codes.
Pattern Recogn. Lett. 35, 120–129 (2014)
18. Satir, E., Isik, H.: A compression-based text steganography method. J. Syst. Softw. 85(10),
2385–2394 (2012)
19. Wayner, P.: Disappearing Cryptography: Information Hiding: Steganography & Water-
marking, 2nd edn. Morgan Kaufmann, San Francisco (2009)
20. Taskiran, C.M., Topkara, U., Topkara, M.: Attacks on lexical natural language
steganography systems. In: Proceedings of the SPIE, Security, Steganography and
Watermarking of Multimedia Contents VIII, San Jose, USA, pp. 97–105 (2006)
21. Meng, P., Huang, L.S, Yang, W.: Attacks on translation based steganography. In: 2009
IEEE Youth Conference on Information, Computing and Telecommunication, Beijing,
China, pp. 227–230 (2009)
Coverless Information Hiding Method 143
22. Nematollahi, M.A., Al-Haddad, S.A.R.: An overview of digital speech watermarking. Int.
J. Speech Technol. 16(4), 471–488 (2013)
23. Mali, M.L., Patil, N.N., Patil, J.B.: Implementation of text watermarking technique using
natural language. In: IEEE International Conference on Communication Systems and
Network Technologies, pp. 482–486 (2013)
24. Xiangrong, X., Xingming, S.: Design and implementation of content-based English text
watermarking algorithm. Comput. Eng. 31(22), 29–31 (2005)
25. Topkara, M., Topkara, U., Atallah, M.J.: Information hiding through errors: a confusing
approach. In: Proceedings of the SPIE International Conference on Security, Steganography,
and Watermarking of Multimedia Contents, San Jose, 6505 V (2007)
26. Rafat, K.F.: Enhanced text steganography in SMS. In: 2009 2nd International Conference on
Computer, Control and Communication, Karachi, pp. 1–6 (2009)
27. Meral, H.M., Sankur, B., Ozsoy, A.S.: Natural language watermarking via morphosyntactic
alterations. Comput. Speech Lang. 23(1), 107–125 (2009)
28. Liu, Y., Sun, X., Wu, Y.: A natural language watermarking based on chinese syntax. In:
Wang, L., Chen, K., Ong, Y.S. (eds.) ICNC 2005. LNCS, vol. 3612, pp. 958–961. Springer,
Heidelberg (2005)
29. Kim, M.Y., Zaiane, O.R., Goebel, R.: Natural language watermarking based on syntactic
displacement and morphological division. In: 2010 IEEE 34th Annual Computer Software
and Applications Conference Workshops, Seoul, Korea, pp. 164–169 (2010)
30. Dai, Z.X., Hong, F.: Watermarking text documents based on entropy of part of speech string.
J. Inf. Comput. Sci. 4(1), 21–25 (2007)
31. Gang, L., Xingming, S., Lingyun, X., Yuling, L., Can, G.: Steganalysis on synonym
substitution steganography. J. Comput. Res. Dev. 45(10), 1696–1703 (2008)
32. Peng, M., Liu-sheng, H., Zhi-li, C., Wei, Y., Ming, Y.: Analysis and detection of translation
based steganography. ACTA Electronica Sinica 38(8), 1748–1752 (2010)
33. Sun, X.M., Chen, H.W., Yang, L.H., Tang, Y.Y.: Mathematical representation of a chinese
character and its applications. Int. J. Pattern Recogn. Artif. Intell. 16(8), 735–747 (2002)
34. Xia, Z., Wang, X., Sun, X., Wang, Q.: A secure and dynamic multi-keyword ranked search
scheme over encrypted cloud data. IEEE Trans. Parallel Distrib. Syst. 99 (2015). doi: 10.1109/
TPDS.2015.2401003
35. Fu, Z., Sun, X., Liu, Q., Zhou, L., Shu, J.: Achieving efficient cloud search services:
multi-keyword ranked search over encrypted cloud data supporting parallel computing.
IEICE Trans. Commun. 98, 190–200 (2015)
System Security
Network Information Security Challenges
and Relevant Strategic Thinking
as Highlighted by “PRISM”
Jing Li(&)
Abstract. The emergence of cloud computing, big data and other technologies
has ushered us in a new age of information-based lifestyle and is profoundly
changing the global economic ecology. It follows that network security risk has
also become one of the most severe economic and national security challenges
confronted by various countries in the 21st Century. The recent “PRISM” event
has exposed the weakness of the Chinese network information security. Given
the huge risks, it is high time that China reconsiders network information
security strategy so as to ensure citizen rights, network information security and
national security while making the most of new technologies to improve eco-
nomic efficiency. As such, this paper attempts to analyze the challenges con-
fronted by China in terms of network information security in the context of the
“PRISM” event, with a view to constructing the Chinese network information
security strategy from the perspective of law.
1 Introduction
The emergence of cloud computing, big data and other technologies has ushered us in a
new age of information-based lifestyle and is profoundly changing the global economic
ecology. It follows that network security risk has also become one of the most severe
economic and national security challenges confronted by various countries in the 21st
Century. The recent “PRISM” event, which is a secret surveillance program under
which the United States National Security Agency (NSA) collects internet communi-
cations from at least nine major US internet companies, has exposed the weakness of
the Chinese network information security. Given the huge risks, it is high time that
China reconsiders network information security strategy so as to ensure citizen rights,
network information security and national security while making the most of new
technologies to improve economic efficiency. As such, this paper attempts to analyze
the challenges confronted by China in terms of network information security in the
context of the “PRISM” event, with a view to constructing the Chinese network
information security strategy from the perspective of law.
the United States for cloud computing service providers to disclose user information to
the American government. However, it severely damages individual information pri-
vacy and security of users in another country and may even influence the national
security of another country.
Secondly, the mode of “separated data owner and controller” of cloud computing
makes it even harder to realize physical regulation and accountability of network
information. In a tradition mode, data stored in local technical facilities is controllable
logically and physically, and hence the risk is also controllable. In a cloud computing
mode, user data is not stored in local computers but stored in a centralized manner in
remote servers outside the firewall. As a result, the traditional information security
safeguarding method relying on machines or physical network boundaries cannot
function any more. When an information security event occurs, the log records may be
scattered in multiple hosts and data centers located in different countries. Therefore,
even applications and hosting service deployed by the same cloud service provider may
impose difficulty in record tracing, which no doubt makes it harder to collect evidence
and keep data confidential [4].
Big data technology shakes basic principle of data protection. Big data refers to a
more powerful data mining method applied by various organizations by relying on
mass data, faster computer and new analytical skills for the purpose of mining con-
cealed and valuable relevant data [5]. All internet giants involved in the “PRISM”,
including Google, Microsoft, Facebook and Yahoo, utilize the big data technology in
different forms and make data their main assets and value source.
Firstly, big data may shake the footstone of data protection regulations. EU data
protection instructions, EU general data protection draft regulations and data protection
laws of other countries in the world mostly rely on the requirement of “transparency”
and “consent” so as to ensure that users can share personal information on a known
basis [6]. However, the nature of big data is to seek unexpected connection and create
results difficult to predict through data mining and analysis [7]. Users are not suffi-
ciently aware of the object and purpose they have consented to. What’s more, the
company itself which utilizes data mining technology cannot predict what will be
discovered through the big data technology. As such, it is very difficult to realize
“consent” in a substantial sense.
Secondly, huge commercial profit brought by the big data technology may induce
service providers to disregard users’ privacy and then threaten data security. In the
context of the internet, on the one hand, multiple network information media have
access to users’ big data, and owing to the big data technology, such information media
can easily obtain and process such information at extremely low costs; on the other
hand, big data can generate astounding commercial profit: According to the report of
McKinsey & Company, big data contributes USD 300 billion to the health industry in
the United State and EUR 250 billion to the public administration industry in Europe
on an annual basis [8]. As a result, network media with access to customers’ huge data
are mobilized enough to take advantage of their customers’ big data in ways
unimaginable and often undetectable for users. More and more commercial organiza-
tions start to see potential commercial opportunities in reselling collected data and
begin to make a profit in this way. Large-scale financial institutions start to market data
150 J. Li
relating to payment cards of their customers (e.g. frequented stores and purchased
commodities); in Holland, a GPS positioning service provider sells its customers’
mobile geocoding to governmental agencies including police service, while such data
have been originally designed to make planning for optimal installation of automatic
transmission radar traps. In the face of huge profit temptation with nobody knowing
and supervising, it is likely that information media will utilize information in a manner
deviating from the original intention, thereby exposing users’ information to great
risks.
The “PRISM” event sounded the alarm for network information security in China. It is
both necessary and urgent to actively develop the Chinese network information security
protection strategies from the perspective of law. With respect to the network infor-
mation security protection strategies, it is a precondition to define the obligations and
liabilities of network service providers at the core of improving network users’ right of
remedies. In the meantime, it is also important to safeguard legal rights on a global
basis and promote the international negotiations in favor of China. Reflections in this
thesis are undertaken from the domestic and international perspectives.
extensive network users. At the same time, in the context of such emerging tech-
nologies as cloud computing and big data, the features, i.e. cross-region, invisibility,
large-scale destruction and uncertainty, of network information security risks have
raised higher requirements for the Chinese network information security legislation
objectively. As such, it is advisable to define the obligations and liabilities of network
service providers from the following aspects:
The burden of proof in favor of users should be established. In the network service
relationship, once a user selects a certain service provider, such service provider will
gain control power over the user’s data and may access the user’s trade secret and
154 J. Li
individual private information by inspecting the user’s records. As such, the user is
placed in a relatively disadvantaged position in terms of acquisition of and negotiation
over technology and information. To this end, in order to prevent network service
providers from misusing their advantageous positions in the network service rela-
tionship, and taking into account the difficulty confronted by individual users in
obtaining evidence from the network, it should be regulated that with respect to user
loss resulted by network secret disclosure, network service providers should mainly
assume the burden of proof while performing relevant obligations of data security
protection.
A one-stop competent authority should be established. Network information
security involves multiple social administration links, including individual privacy,
commercial order, telecommunication facility security, financial security, public
security and national security. Competent authorities of each link are involved in part
of the network supervision functions, but the multiple administration system where
everybody doing things in his own way has increased the uncertainty and cost of
corporate compliance and made it more difficult for citizens to safeguard their legal
rights. It is thus advisable to set up a network information security competent authority
covering all industries. Considering the strategic significance of network security, such
competent authority should be directly under the State Council, and its responsibilities
should include the followings: develop network information security policies and
strategies; supervise whole-industry network security compliance with corresponding
administrative enforcement power, serve as a reception portal processing user com-
plaints concerning network information security, and facilitate the settlement of user
disputes regarding network security in a low-cost, high-efficiency and one-stop manner.
Multi-level remedy procedures should be provided. Besides traditional judicial
procedures, it is advisable to introduce mediation and arbitration procedures specific to
the features of network data protection disputes, i.e. extensive involvement, strong
technicality and time sensitivity, as well as to set up an arbitration commission under
the network competent authority dedicated to the processing of personal information
disputes. Consisting of supervisory government officials, network security technicians,
consumer representatives, internet service provider representatives, scholars, etc., the
arbitration commission should propose a mediation plan within the given time limit for
the reference of various parties. If the plan is accepted by various parties, it should be
implemented as a binding document; if not, judicial procedures should be started. On
the other hand, as data disclosure involves a great number of people, and the internet
industry, as an emerging industry, will suffer from huge cost if time-consuming
independent action is resorted to, it is advised that the procedure of data protection
class action/arbitration be stipulated, the effect and procedure linkage between action
and arbitration be defined and guidance be provided to settle disputes quickly and
conveniently through class action or arbitration.
perfected but also efforts should be made to exert influence on the tendency of future
negotiation and international cooperation in terms of network security international
standards.
Today, network security has become a common challenge confronted by the
international community. It is no longer possible to realize prevention against network
security risks in the closed domestic law system exclusively, but rather cooperation and
settlement should be sought after in the international law framework. Under many
circumstances, the network security issue is no longer a country’s internal judicial issue
but rather a judicial issue involving multiple countries. Particularly, information
leakage events with extensive influence and involving parties in multiple countries such
as the “PRISM” will inevitably involve practical difficulties in terms of cross-border
evidence collection, coordination of dispute jurisdiction and definition of law appli-
cation [13]. Currently, many international organizations, including the United Nation
(UN), Organization for Economic Co-operation and Development (OECD), Asia
Pacific Economic Cooperation (APEC), European Union (EU), North Atlantic Treaty
Organization (NATO), Group of Eight, International Telecommunication Union
(ITU) and International Organization for Standardization (ISO), are addressing the
issue of information and communication infrastructure. Some newly established
organizations have started to consider the development of network security policies and
standards, and existing organizations also actively seek to expand their functions to this
field [14]. Such organizations’ agreements, standards or practices will generate global
influence. Meanwhile, following the “PRISM” event, various countries start to adjust
their existing network security strategies successively. It is foreseeable that the coming
several years will be a crucial period propelling the establishment of the network
security international cooperation framework. As the leading network security victim
and the largest developing country, on the one hand, China should conduct relevant
research to systematically learn about the features and tendencies of various countries’
network security strategies, particularly evidence of certain countries infringing upon
another country’s citizen network information security, so as provide backup to China
in safeguarding legal rights in the international battlefield; on the other hand, it is
necessary that China takes the initiative to participate in the development of network
security international conventions and standards so as to fully express China’s rea-
sonable appeal for interest and actively strive for China’s benefit in the network space
through close cooperation with various countries and by virtue of various international
platforms.
4 Conclusions
Network information technology is like a double-edged sword which can either vig-
orously facilitates a country’s economic development or monopolizes global infor-
mation. What is disclosed by the “PRISM” event is only a tip of the iceberg of the
network information security risks confronted by China. As such, it is highly urgent to
thoroughly explore the roots of network information security risks and develop national
network security strategies conforming to the industrial development and security
benefit of China. The current idea of the Chinese network information security
156 J. Li
strategies is as follows: for one thing, the domestic legislation and remedy system
should be developed on the precondition of defining obligations and liabilities of
network service providers at the core of improving the right of remedy of network users
so as to correct market failure and realize the maximal efficacy of policies in terms of
ensuring network security and promoting technical development and application; for
another, legal rights should be safeguarded on a global basis and international nego-
tiations should be promoted in favor of China so as to safeguard China’s interest in the
network space. China can overcome the current obstacles and prevent against potential
risks only by closely centering on chains and cruxes handicapping Chinese network
information security.
References
1. Boyd, D., Crawford, K.:Six Provocations for Big Data, http://softwarestudies.com/cultural_
analytics/Six_Provocations_for_Big_Data.pdf
2. Simmons, J.L.: Buying you: the Government’s use of fourth-parties to launder data about
‘The People’. Columbia Bus. Law Rev. 3, 956 (2009)
3. Froomkin, A.M.: The internet as a source of regulatory arbitrage. In: Kahin, B., Nesson, C.:
Borders in Cyberspace: Information Policy and the Global Information Structure, p. 129.
MIT Press, Massachusetts (1997)
4. Dupont, B.: The Cyber Security Environment to 2022: Trends, Drivers and Implications.
http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2208548
5. Rubinstein, I.: Big data: the end of privacy or a new beginning? Int. Data Priv. Law 3(2), 74
(2013)
6. Kuner, C., Cate, F.H., Millard, C., Svantesson, D.J.B.: The challenge of ‘Big Data’ for data
protection. Int. Data Priv. Law 2(2), 47–49 (2012)
7. Tene, O., Polonetsky, J.: Big data for all: privacy and user control in the age of analytics.
Northwest. J. Technol. Intellect. Property 5, 261 (2013)
8. McKinsey Global Institute: Big Data: The Next Frontier for Innovation, Competition, and
Productivity. http://www.mckinsey.com/insights/business_technology/big_data_the_next_
frontier_for_innovation
9. Telang, R., Wattal, S.: impact of software vulnerability announcements on the market value
of software vendors-an empirical investigation. IEEE Trans. Softw. Eng. 33, 8 (2007)
10. Moore, T.: The economics of cybersecurity: principles and policy options. Int. J. Crit.
Infrastruct. Prot. 3(3–4), 110 (2010)
11. Lieyang, Q.: It is Essential to clarify the standard of punishment of infringing personal
privacy, PSB Newspaper. http://www.mps.gov.cn/n16/n1237/n1342/n803715/3221182.html
12. European Union: How will the EU’s data protection reform make international cooperation
easier? http://ec.europa.eu/justice/data-protection/document/review2012/factsheets/5_en.pdf
13. Yan, Z., Xinhe, H.: Cyber security risk and its legal regulation. Study Dialectics Nat. 28(10),
62 (2012)
14. THE WHITE HOUSE: Cyberspace Policy Review: Assuring a Trusted and Resilient Information
and Communications Infrastructure. http://www.whitehouse.gov/assets/documents/Cyberspace_
Policy_Review_final.pdf
A Lightweight and Dependable Trust Model
for Clustered Wireless Sensor Networks
Abstract. The resource efficiency and dependability are the most basic
requirements for a trust model in any wireless sensor network. However, owing
to high cost and low reliability, the existing trust models for wireless sensor
networks can not satisfy these requirements. To take into account the issues, a
lightweight and dependable trust model for clustered wireless sensor network is
proposed in this paper, in which the fuzzy degree of nearness is adopted to
evaluate the reliability of the recommended trust values from the third party
nodes. Moreover, the definition of a self-adapted weighted method for trust
aggregation at CH level surpasses the limitations of the subjective definition of
weights in traditional weighted method. Theoretical analysis and simulation
results show, compared with other typical trust models, the proposed scheme
requires less memory and communication overhead and has good fault tolerance
and robustness, which can effectively guarantee the security of wireless sensor
network.
1 Introduction
existing trust models for clustered WSN can not meet the above requirements. There is
still a lack of trust models for clustered WSN which meet the requirements of both the
resource efficiency and trust reliability.
In order to solve the above problem, a lightweight and dependable trust model is
proposed in this paper for clustered wireless sensor network, in which a lightweight
trust decision-making scheme is used. Fuzzy degree of nearness is adopted to evaluate
the reliability of the recommended trust values from the third party nodes and then
assign weights to them based on the reliability, which can greatly weaken the effect of
malicious nodes to trust evaluation. In addition, we define the adaptive weight
approach for trust fusion at the cluster head levels, which can surpass the limitations of
the subjective definition of weights in traditional weighting methods. What is more, it
can resist collusion attack in a certain extent.
The trust model proposed in this paper constructs on the clustered wireless sensor
network topology for node trust evaluation. Sensor nodes monitor and record commu-
nication behaviors of its neighbor nodes in a test cycle Dt and its monitoring events
include successful interactive behaviors and unsuccessful interactive behaviors of the
monitoring nodes. If the monitoring results achieve the expectation, the monitoring event
is defined as a successful interaction, otherwise, an unsuccessful interaction. In this
paper, all sensor nodes in the network are assumed to have a unique identity identified
with a triple < location, node type, node subtype > [11]. According to the different roles
of nodes in the clustered WSN, the proposed trust model evaluates trust relationship in
two levels: intra-cluster trust evaluation as well as inter-cluster trust evaluation.
A Lightweight and Dependable Trust Model 159
CM-to-CM Direct Trust Calculation. We define the direct trust evaluation method
of CMs within the cluster as follows:
2 0 13
6 10 ai;j ðDtÞ B 1 C7
Ti;j ðDtÞ ¼ 6 @qffiffiffiffiffiffiffiffiffiffiffiffiffiffiA7 ð1Þ
6ai;j ðDtÞ þ bi;j ðDtÞ bi;j ðDtÞ 7
6 7
In which Dt is the time window whose size can be set by the specific scenario. Thus,
with the passage of time, the time window will forget old observations and add new
observations. de is the nearest integer function, such as d4:56e ¼ 5. ai;j ðDtÞ and bi;j ðDtÞ
respectively present the total numbers of successful and unsuccessful interactions of
node i with node j during the time Dt. Under special circumstances, if ai;j ðDtÞ 6¼ 0 and
bi;j ðDtÞ ¼ 0, set Ti;j ðDtÞ ¼ 10. If ai;j ðDtÞ þ bi;j ðDtÞ ¼ 0, which means there is no
interactions during the time Dt, set Ti;j ðDtÞ ¼ <ch;j ðDtÞ (as shown in formula (5)).
<ch;j ðDtÞ is the feedback trust towards node j from the cluster head ch. Thus, it can be
seen that a CM evaluates its neighbor nodes’ trust value based on Ti;j ðDtÞ and <pch;j ðDtÞ.
Given that there is no need to consider the feedback trust between CMs, namely the
direct trust value towards the evaluated nodes from the third parties, the proposed
mechanism can save lots of network resources. Besides, we can see from the formula (1)
.qffiffiffiffiffiffiffiffiffiffiffiffiffiffi
that, along with the number of interactive failure increasing, the expression 1 bi;j ðDtÞ
rapidly tends to 0 which means that the proposed scheme has the characteristics of strict
punishment for the failure interactions. The characteristics of strict punishment can
prevent suddenly attacks from those malicious nodes with high trust degree effectively.
CH-to-CM Feedback Trust Calculation. We assume that except the cluster head,
there exists N-1 CMs within a cluster. The cluster head ch periodically broadcasts a
trust request packet within the cluster. And all CMs will forward the trust values
towards other CMs to ch as a response and then a matrix U is formed to maintain the
trust values, which is shown as following:
2 3
T1;1 T1;2 T1;N1
6 T2;1 T2;2 T2;N1 7
6 7
U¼6
6 ..
7
7 ð2Þ
4 . 5
TN1;1 TN1;2 TN1;N1
Where Ti;j (i 2 ½1; N 1 j 2 ½1; N 1,) represents the direct trust value of node
i towards node j. When i = j, Ti;j means a node’s rating towards itself. In order to avoid
advocating, cluster head ch will drop this value in the fusion of feedback trusts.
From formula (1), we can see that Ti;j is the monitoring results of different nodes
towards the same property (trust value) of the same node during the same time period.
Thus, generally speaking, these data has a great similarity and tends to a central value.
160 N. Shao et al.
If malicious nodes slander other nodes, its trust value on the target object will deviate
significantly from the normal ones and be recognized. According to the above idea,
the concept of the fuzzy degree of nearness in fuzzy mathematics is introduced. Using
the maximum and minimum nearness degree theory [16], we can measure the reliability
of each node’s trust values towards the node j during the same monitoring period.
During a monitoring period, the nearness of trust values of node k and node
l towards node j is defined as follows:
rk;l ¼ min Tk;j ; Tl;j max Tk;j ; Tl;j ð3Þ
Furthermore, we can obtain the nearness degree matrix S of all nodes about node j,
shown as follows:
2 3
r1;1 r1;2 r1;N2
6 7
6 r2;1 r2;2 r2;N2 7
6 7
S¼6 .. 7 ð4Þ
6 . 7
4 5
rN2;1 rN2;2 rN2;N2
It can been seen that, for the elements in any line k in the matrix S, if the value of
P
N2
rk;l is relatively large, the trust value of node k towards node j is close to the trust
l¼1
values of the other nodes towards node j, which means a high reliability. On the
P
N2
contrary, if the value of rk;l is smaller, it means that the trust value of k towards
l¼1
j deviates significantly from the center value of the recommended trust values, namely,
the reliability of the trust value of node k towards node j is low.
In order to ensure the objectivity and accuracy of trust evaluation, node j’ trust
value is obtained by the fusion of other nodes’ trust values towards it within the cluster.
xk is the weight assigned to different recommendations during the fusion process, and
<ch;j ðDtÞ represents the fusion of the recommended trust values, shown as follows:
X
N 2
<ch;j ðDtÞ ¼ xk Tk;j ð5Þ
k¼1
P
N2
Where xk satisfies xk ¼ 1 (0 xk 1). According to the above content, xk
k¼1
depends on the reliability of each node’s trust value towards node j. That means it is
related to the nearness of trust values. Thus, it should contain all the information of the
nearness of trust values of node k and other nodes, which should be defined as follows:
,
X
N 2 XN 2 X
N 2
xk ¼ rk;l rk;l ð6Þ
l¼1 k¼1 l¼1
A Lightweight and Dependable Trust Model 161
Where li;j ðDtÞ and mi;j ðDtÞ respectively present the total number of successful and
unsuccessful interactions of CH i and CH j during the time Dt. Under special cir-
cumstances, if li;j ðDtÞ 6¼ 0,mi;j ðDtÞ ¼ 0, then we set Yi;j ðDtÞ ¼ 10.
BS-to-CH Feedback Trust Calculation. We assume there are M CHs in the whole
network, which means the whole network is divided into M clusters. The base station
bs broadcasts a trust request packet periodically across the whole network. And all CHs
will forward the trust values towards other CHs to bs and then a matrix W is formed to
maintain the trust values, which can be seen as follows:
2 3
Y1;1 Y1;2 Y1;M
6 7
6 Y2;1 Y2;2 Y2;M 7
6 7
W¼6 .. 7 ð8Þ
6 . 7
4 5
YM;1 YM;2 YM;M
According to the same method in Sect. 2.2, CH j’s feedback trust value at the base
station bs is as follows:
X
M 1
=bs;j ðDtÞ ¼ Wk Yk;j ð9Þ
k¼1
xl Yi;j h
W¼ ð11Þ
xh Yi;j [ h
Where 0\xl \0:5\xh \1, h is the default trust threshold. When the trust value of
CH i towards CH j is below the default trust threshold h, which indicates CH j performs
malicious behavior, the value of W is small. That means the trust value evaluation is
mainly dependent on the judgment of CH i itself, which can prevent other cluster heads
to implement collusion attacks to boost CH j. When the trust value of CH i towards CH
j is above the default trust threshold h, the value of W is big, which indicates the trust
value evaluation is mainly dependent on the other cluster heads’ judgments. In this
way, we can prevent malicious nodes from rapidly accumulating a high trust value,
which may cause the difficulty to detect malicious nodes.
In summary, at inter-cluster level, we define the trust decision-making rules as
follows: when CH i needs to interact with CH j, it will compute Yi;j ðDtÞ based on the
history interactions between CH i and CH j. And at the same time, i will request the
base station for feedback trust. Then the two trust sources are fused to get a compre-
hensive trust value, based on which CH i makes its decision.
malicious nodes with a high trust degree from implementing attack. This is the first
kind of safe defense. Secondly, in the calculation of feedback trust, the theory of fuzzy
nearness degree is introduced. Based on the node’s reliability, we assign weights to the
recommendations in the fusion of recommendation trust from the third-party. That
means if malicious nodes launch slandering/boosting attacks, their trust values on the
target object will deviate significantly from the normal ones for recognition. Accord-
ingly, assign a smaller weight to the recommendations, which will reduce the effect
from malicious nodes on target node’s trust value, namely effectively resisting
bad-mouthing attacks. This is the second kind of safe defense. In the calculation of trust
at inter-cluster level, we propose a self-adaptive weight factor, which adaptively select
weights according to the specific circumstances of the evaluated nodes. That is to say,
when the trust value of CH j is below a preset threshold, the trust value calculation
mainly depends on the judgment of CH i itself, which can prevent other CHs imple-
menting collusion attacks to boost the trust value of CH j. When the trust value of CH
j is above a preset threshold, the trust value calculation is mainly dependent on the
recommendations from other CHs, which can prevent malicious nodes rapidly accu-
mulating high trust value, resulting in the difficulty to detect malicious nodes, and
finally harms network security. This is the third kind of safe defense.
7
x 10
4.5
the maximum communication overhead(packets)
2.5
1.5
0.5
0
0 500 1000 1500 2000 2500
number of clusters
a storage cost of 0:5M ðN 1Þ2 . And in the other table, CH keeps a trust database
similar to that of CM whose size of each record is also the same as that of CM,
resulting in a storage cost of 7(M-1). Therefore, the storage cost for each CH is
7ðM 1Þ þ 0:5M ðN 1Þ2 . The storage costs of the various trust models are given in
Table 3, where N is the average size of each cluster, M is the number of clusters, Dt is
the time window defined in GTMS, and K is the number of trust contexts defined in
ATRM.
Figure 3(a) and (b) give a comparison of storage overheads of four various trust
models for a large-scale clustered wireless sensor network. From Fig. 3(a), we can see
that the storage overhead of the proposed scheme is slightly lower than LDTS, but
obviously lower than the other two schemes, which indicates that the scheme needs less
storage overheads than the other three ones at the CM level. It can been from Fig. 3(b)
that with the number of the clusters increasing, compared with GTMS and ATRM, this
proposed scheme needs less storage overheads, which means this scheme is more
suitable for large-scale wireless sensor networks with a smaller cluster size.
5
x 10
5000 5
the proposed scheme the proposed scheme
4500 LDTS 4.5
the maximum storage overhead(bytes)
LDTS
the maximum storage overhead(bytes)
GTMS GTMS
4000 ATRM 4
ATRM
3500 3.5
3000 3
2500 2.5
2000 2
1500 1.5
1000 1
500 0.5
0 0
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
number of clusters number of clusters
In our simulation experiments, there exist three kinds of sensor nodes, namely CM, CH
and BS. A CM and a CH can exist in two forms, either a cooperator or a rater. As a
cooperator, a CM can be divided into two categories: good CM (GCM) and bad CM
166 N. Shao et al.
(BCM) based on its behavior. GCM will provide successful communication behaviors
according to the network protocol and BCM will cause the failure of communication
behaviors. Accordingly, as a rater, a CM can also be classified into two categories:
honest CM (HCM) and malicious CM (MCM) according to its behavior. HCM always
provide a reasonable evaluation for other CMs while MCM gives arbitrary evaluation
towards other CMs between 0 and 10. Similarly, CH can also be divided into GCM,
BCM, HCM and MCM. Simulation scenario is set in the area of 1000 m*1000 m in
which 1000 sensor nodes (namely m n ¼ 1000) are randomly distributed. We ana-
lyze the models by setting different clusters.
In order to reflect the reliability of the trust management system, we analyze the
data packet delivery ratio (PSDR). The higher a PSDR is, the higher the reliability is.
We assume in the network, most of the CMs and CHs perform well, where BCM, BCH
respectively comprise 10 %. This assumption is close to the real scene.
1 1 1
0.6 the proposed scheme 0.6 the proposed scheme 0.6 the proposed scheme
LDTS LDTS LDTS
GTMS GTMS GTMS
0.5 0.5 0.5
200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000
time-stamp(number of interactions) time-stamp(number of interactions) time-stamp(number of interactions)
Figure 4 shows the comparison of PDSR of three various trust models with dif-
ferent percentages of MCHs. In the experiments, we assume that there are 95 % honest
CMs, and the remaining 5 % CMs are malicious. Then set the percentage of MCHs as
5 %, 10 % and 20 %, respectively meaning an honest sensor network environment, a
relatively honest one and a dishonest one. From (a), we can see that under the honest
network environment, PDSR of the three schemes are very high. However, compared
with (a), (b) and (c) have big differences. With the percentage of MCHs increasing,
performances of the three schemes are significantly decreased. Relatively speaking, the
performance of the proposed scheme is relatively stable, significantly better than the
other two schemes, with strong robustness. This is because in the presence of malicious
attacks, the proposed scheme uses the concept of fuzzy nearness degree for recom-
mendation trust fusion, which improves the accuracy of trust calculation. At the same
time, in order to reduce the risk of trust evaluation, we define a self-adaptive weight
factor for the fusion of direct trust and feedback trust, which ensures the objectivity of
the trust and contributes to the effective recognition of malicious nodes, thus improving
the reliability and security of the whole network.
Figure 5 gives the comparison of PDSR of three various trust models with different
percentages of MCMs. It can be seen from the figure, compared with the other two
A Lightweight and Dependable Trust Model 167
1 1 1
0.6 the proposed scheme 0.6 the proposed scheme 0.6 the proposed scheme
LDTS LDTS LDTS
GTMS GTMS GTMS
0.5 0.5 0.5
200 400 600 800 1000 200 400 600 800 1000 200 400 600 800 1000
time-stamp(number of interactions) time-stamp(number of interactions) time-stamp(number of interactions)
schemes, our scheme has a higher reliability. (a) gives the simulation results in an
honest network environment, where there are 10 % MCMs as well as 10 % MCHs. We
can see that the performances of the three schemes are relatively stable. (b) and
(c) shows the experimental results under the relatively honest and dishonest network
environments, from which we can see that, with the increase of percentage of MCMs,
the performances of the three schemes have decline of different degrees. But the
proposed scheme still outperforms LDTS and GTMS, indicating that the reliability of
the proposed scheme is higher and it is more applicable to clustered WSN.
5 Conclusions
The resource efficiency and dependability are the basic requirements for a trust model
in any wireless sensor network. However, owing to high cost and low reliability, the
existing trust models for wireless sensor networks can not satisfy these requirements. In
view of the above problems, a lightweight and dependability trust model is proposed in
this paper. By the introduction of the nearness of fuzzy in fuzzy theory, we measure the
reliability of the third-party recommended trust to improve the accuracy and objectivity
of the calculation of trust, which contributes to detect malicious nodes. Moreover, the
definition of a self-adapted weighted method for trust aggregation at CH level surpasses
the limitations of the subjective definition of weights in traditional weighted method.
Theoretical analysis and simulation results show that, compared with the other classical
WSN trust models, the proposed scheme requires less memory and communication
overheads. Besides, it can effectively resist the garnished attack, bad-mouthing attack
and collusion attack. With a high reliability, the proposed model can effectively
guarantee the security and normal operation of the whole network.
References
1. Khalid, O., Khan, S.U., Madani, S.A., et al.: Comparative study of trust and reputation
systems for wireless sensor networks. Secur. Commun. Netw. 6, 669–688 (2013)
2. Kant, K.: Systematic design of trust management systems for wireless sensor networks: a
review. In: 4th IEEE International Conference on Advanced Computing and Communication
Technologies (ACCT), pp. 208–215 (2014)
3. Ishmanov, F., Malik, A.S., Kim, S.W., et al.: Trust management system in wireless sensor
networks: design considerations and research challenges. Trans. Emerg. Telecommun.
Technol. 26, 107–130 (2015)
4. Ganeriwal, S., Balzano, L.K., Srivastava, M.B.: Reputation-based framework for high
integrity sensor networks. ACM Trans. Sensor Netw. (TOSN) 4, 15 (2008)
5. Yao, L., Wang, D., Liang, X., et al.: Research on multi-level fuzzy trust model for wireless
sensor networks. Chin. J. Sci. Instru. 35, 1606–1613 (2014)
6. Duan, J., Gao, D., Yang, D., et al.: An energy-aware trust derivation scheme with game
theoretic approach in wireless sensor networks for IoT applications. IEEE Internet
Things J. 1, 58–69 (2014)
7. Zhang, M., Xu, C., Guan, J., et al.: A novel bio-inspired trusted routing protocol for mobile
wireless sensor networks. KSII Trans. Internet Inf. Syst. (TIIS). 8, 74–90 (2014)
8. Boukerche, A., Xu, L., EL-Khatib, K.: Trust-based Security for wireless Ad Hoc and sensor
networks. Comput. Commun. 30, 2413–2427 (2007)
9. Shaikh, R.A., Jameel, H., d’Auriol, B.J., et al.: Group-based trust management scheme for
clustered wireless sensor networks. IEEE Trans. Parallel Distrib. Syst. 20, 1698–1712
(2009)
10. Bao, F., Chen, R., Chang, M.J., et al.: Hierarchical trust management for wireless sensor
networks and its applications to rrust-based routing and intrusion detection. IEEE Trans.
Netw. Serv. Manage. 9, 169–183 (2012)
11. Li, X., Zhou, F., Du, J.: LDTS: a lightweight and dependable trust system for clustered
wireless sensor networks. IEEE Trans. Inf. Forensics Secur. 8, 924–935 (2013)
12. Younis, O., Fahmy, S.: HEED: a hybrid, energy-efficient, distributed clustering approach for
ad-hoc sensor networks. IEEE Trans. Mob. Comput. 3, 366–379 (2004)
13. Wei, D., Jin, Y., Vural, S., et al.: An energy-efficient clustering solution for wireless sensor
networks. IEEE Trans. Wireless Commun. 10, 3973–3983 (2011)
14. Javaid, N., Qureshi, T.N., Khan, A.H., et al.: EDDEEC: enhanced developed distributed
energy-efficient clustering for heterogeneous wireless sensor networks. Procedia Comput.
Sci. 19, 914–919 (2013)
15. Han, G., Jiang, J., Shu, L., et al.: Management and applications of trust in wireless sensor
networks: a survey. J. Comput. Syst. Sci. 80, 602–617 (2014)
16. Yang, J., Gong, F.: Consistent and reliable fusion method of multi-sensor based on degree of
nearness. Chin. J. Sens. Actuators 23, 984–988 (2010)
The Optimization Model of Trust
for White-Washing
Nowadays, more and more services rely on the network. As the openness and anonymity
of the network, there are often huge risks behind these services. Based on the research of
twitter user’s behavior, Bilge [1] pointed out that users will click the message sent by
attackers unconsciously after attackers gets their trust which would bring losses to them.
Jagatic [2] expressed that the stolen specific information of users got a higher success
rate on phishing. As the enormous increase of social network users, many kinds of
malicious behaviors and related malicious nodes have appeared which consume network
resources, damage the interests of others and endanger the safety of the public network.
There are different kinds of trust models [3–8] which can help users select the desirable
services in huge network system, but they usually neglect the white-washing behavior.
When the trust of malicious node in the system is low, the node can withdraw and
re-enter into the system with a new identity. In this way, the node can have a higher
initial trust, and thus it is able to continue its malicious behavior.
The prime factors for white-washing are as follows: (a) The cost of the new user
registration is low, so it is easy for a user to get a new identity. (b) A new user often has
higher initial trust. (c) Most trust models do not consider nodes’ history dwell time (the
time from the user register to right now) in the system. As the network is openness, it’s
impossible to increase the cost of user registration. So in order to resist white-washing,
it’s a must to take dwell time and appropriate initial trust into account.
Huang [9] has taken dwell time into consideration, which split all the services in the
same domain into the mature service queue and the novice service queue so that the
new services only compete with new ones until they grow matured. However, it is
unfair to the mature services those are in the back of the mature queue for that they may
be better than top services in the novice queue. So it needs a better way to select service
between two queues. Some bind presenter and presentee together with feedback
method [10, 11], but they dampen the enthusiasm of nodes to join system because the
new nodes do not have a large number of interaction, as a result, they cannot rec-
ommend or be recommended effectively. It doesn’t conform to the demand of network.
Barra [12] resists the entry of malicious nodes by increasing the system sensitivity, i.e.,
when a malicious node in the system is found, the system will increase its standards for
a period of time, thus shielding the malicious nodes. But this method is appropriate for
live streaming system instead of the network.
In this paper, all nodes in the system will be divided into two different groups: safe
group and dubious group, and nodes in different group have different right: safe nodes
can recommend directly, while dubious nodes can recommend only when there is a safe
node providing guarantee for it. Hence the major contributions of this paper can be
summarized as follows:
– A trustworthy recommendation model is proposed, which contains grouping pro-
cess and guaranteed process.
– The experiments show that the model can resist the white-washing attacks effec-
tively, and ensure normal interaction activities of the other nodes at the same time.
The remainder of this paper is organized as follows. Section 2 presents the trust-
worthy recommendation model. Section 3 shows the experimental result. Section 4
concludes the paper.
1X n
GTx ¼ DT ði; xÞ ð1Þ
n i¼1
The model proposed in the paper is made up of four parts: nodes grouping,
TOPSIS-based nodes guarantee, trust attenuation and trust updating. To describe the
model better, we define the following tuples:
Resource Demand Application (RDA) = (Demander_ID, ServiceContents,
TimeStamp)
Here, Demander_ID refers to the node who requests resource; ServiceContents
refers to the requested resource; TimeStamp refers to the time of sending RDA.
Resource Recommend Message (RRM) = (Recommend_ID, ServiceContents,
Response_Time, Recommend_Oneself_bool, Guarantor_ID)
Here, Recommend_ID refers to the node who provides the recommendation; Ser-
viceContents refers to the recommended resource; Response_Time refers to the
time of sending RRM; Recommend_Oneself_bool refers to the node is SR or OR;
Guarantor_ID refers to the node who make a guarantee.
Figure 1 is the graph of the main network structure.
Where SDy refers to the variance of the nodes group that have direct interaction
with y; SDyx also refers to the variance of the nodes group that have direct interaction
with y, but without x.
Then the departure degree of x can be expressed as;
1X n
DDx ¼ 1 DDxi ð3Þ
n i¼1
Here, DDx is made with the trend to cooperate with the date below, i.e., the smaller
DDx is, the lower departure degree of x is.
When a node has white-washed, it has a higher initial trust value, a low dwell time,
and the trading record is null. So here uses dwell time, number of trading, global trust
and departure degree to divide the nodes into two groups: safe group and dubious
group.
Defining tuple \T; N; GT; DD [ T; N; GT; DD refer to dwell time, number of
trading, global trust and departure degree in turn. And Th; Nh; GTh; DDh refer to
critical value of the four indexes above separately. Then the grouping rule is:
Only Ti Th; Ni Nh; GTi GTh; DDi DDh, node can be assigned to safe
group, otherwise, it will be assigned to dubious group.
Defining critical coefficient c (0 c 1), here c means to eliminate the last c nodes
after sorting. i.e. If the critical coefficient of dwell time is 0.3, it will eliminate the last
30 % nodes after sorting by dwell time, then the critical dwell time is the minimum
number of the remaining 70 % nodes. The critical coefficient is higher, the grouping
conditions are stricter, and the nodes in safe group are more credible. When the critical
coefficient is 0, the model degrades into a traditional model.
the guarantee of safe nodes. Here, the system evaluates each index with the information
entropy, and then selecting the recommended resource with TOPSIS.
Information Entropy Weight of Evaluation Index. Information entropy is a mea-
surement index of uncertainty in information theory, i.e., the more information an index
carries and transmits, the smaller the remaining uncertainties and the residual entropy
are, i.e., this index is more useful than others on decision. Here, the weight of infor-
mation entropy is used to reflect the importance of each index in decision-making.
Supposing there are m nodes,
each node contains n assessment indexes. Making the
evaluation matrix X ¼ xij mn , xij represents the jth evaluation index of the ith node.
Here, data is standardized as each dimension of index is different:
xij
sxij ¼ sffiffiffiffiffiffiffiffiffiffiffi ; i ¼ 1; 2. . .m; j ¼ 1; 2. . .n ð4Þ
Pm
x2ij
i¼1
1 X
ej ¼ sxij ln sxij ; j ¼ 1; 2. . .n 0 ej 1 ð5Þ
ln m
As the information entropy of evaluation index is higher, the weight of the index
should be smaller. Then, the weight of information entropy of each index is:
1 ej
xj ¼ Pn ; j ¼ 1; 2. . .n ð6Þ
n ej
j¼1
Only if sxi1 ¼ sxi1 ¼. . .¼sxin is satisfied, can ej reach the maximum value, mean-
while,xj ¼ 0; that is to say the jth index provides no useful information.
Nodes Selecting with TOPSIS. After the nodes in need send RDAs in the system,
there will be two cases:
Direct Recommendation of Safe Node:.
In this circumstances, there is no limit on the recommendation, safe nodes can
provide their SRs to the requirement node directly. And the node in need will calculate
the recommendation trust value with the direct/indirect trust and global trust as follows:
(
aDT ðx; yÞ þ ð1 aÞGTy x has direct interaction with y
RT ðx; yÞ ¼ ð7Þ
bIT ðx; yÞ þ ð1 bÞGTy x has no direct interaction with y
When a safe node is ready to provide a guarantee, it may receive more than one
request of dubious nodes, therefore there should be a method to help safe nodes select
dubious nodes. Here, we use the method of TOPSIS which is short for Technique for
Order Preference by Similarity to an Ideal Solution. The idea of TOPSIS is sorting the
nodes according to the close degree to the ideal target. If the node closest to the best
solution and farthest away the worst solution, it will be the optimal node. Otherwise, it
will be the worst. In this paper, the TOPSIS is simplified, only profitability index is
taken into account.
Supposing there are m nodes,
each node contains n assessment indexes. Making
the
evaluation matrix X ¼ xij mn , then the weighted decision matrix is Y ¼ yij mn
where yij ¼ xj sxij ði ¼ 1; 2. . .m; j ¼ 1; 2. . .nÞ.
The set of best value Y þ and the set of worst value Y are:
Y þ ¼ y1þ ; y2þ . . .ynþ ¼ maxyij jj 2 Je ; i ¼ 1; 2. . .n
ð8Þ
Y ¼ y
1 ; y2 . . .yn ¼ minyij jj 2 Je ; i ¼ 1; 2. . .n
Where Je refers to profitability index. Then the distance between evaluation value
and best/worst value is:
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uX
u n 2
dfi þ ¼t yij yjþ ; i ¼ 1; 2. . .m ð9Þ
j¼1
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uX
u n 2
dfi ¼t yij y j ; i ¼ 1; 2. . .m ð10Þ
j¼1
Then, the close degree of the dubious node to ideal node is:
dfi
cdi ¼ þ ; i ¼ 1; 2. . .m ð11Þ
dfi þ dfi
The Optimization Model of Trust for White-Washing 175
Apparently, 0\cdi 1, and the cdi is bigger, the degree of dubious node to ideal
node is closer.
Then the node in need will calculate the recommendation trust value among safe
node, dubious node and itself similar to direct recommendation of safe node,.
8
>
> a1 DT ðx; kÞ þ ð1 a1 ÞðdGTy þ ð1 dÞGTk Þ
>
>
>
< x has direct interaction with k
RT ðx; yÞ ¼ ð12Þ
>
> a2 IT ðx; kÞ þ ð1 a2 ÞðdGTy þ ð1 dÞGTk Þ
>
>
>
:
x has no direct interaction with k
Where x is the node in need, y is the dubious node and k is the safe node, a1 ; a2 ; d
are the regulatory factors.
ðktÞk
f ðN ðtÞÞ ¼ PfN ðt þ sÞ N ðsÞ ¼ kg ¼ ekt ; k ¼ 0; 1; 2. . . ð13Þ
k!
Where k is the trading number and k is the trading number for unit time.
So, at time t, the direct trust of x to y is:
ðktÞk
DT ðx; yÞ = ekt DT ðx; yÞ ð14Þ
k!
Where x1 ; x2 ; x3 are regulatory factors of feedback,DT ðx; yÞn refers to the trust
value after updating, DT ðx; yÞn1 refers to the trust value before updating.
As the number of users is enormous, frequently updating the groups is needless and
impossible. So it sets a rate here, only recently increased interactions in the system up
to the rate of update, it can update all the system parameters include grouping.
The experiments simulate the trust recommendation model and verify the model’s
property of resisting white-washing. Meanwhile, through the comparison with Peer-
Trust model [14], it can be verified that this paper’s model can resist malicious nodes
well (Table 1).
The experiment is composed by several trading cycles and every safe node has once
interaction in a cycle. Experiments based on Matlab2014b, running environment is
CPU 2.6 GHz, 8 G memory.
Experiment 1. The Analysis of White-washing Resistance. The existing white-
washing resistant models are always applied to resource sharing system, which usually
use the number of available resources to measure the effectiveness of the models.
However, our model is used in recommendation system, of which the effectiveness can
only be measured by the interval of the recommendation made by malicious nodes
before and after white-washing, as the trust change can not decide whether the rec-
ommendation is accepted. So the experiment which shows the trading cycles with and
The Optimization Model of Trust for White-Washing 177
without our model does not make comparisons between our white-washing models and
other same models, but between our model and the traditional recommendation trust
model.
With our model whether the malicious node is white-washed or not, the malicious
node can only stay in the dubious group which cannot recommend resource to others
directly from beginning to end. When the white-washed node re-enter the system, it has
the lowest dwell time among all nodes in the system. So if the white-wash node wants
to be transferred into the safe group, it will cost more time then before.
The figure below shows the average number of trading cycles the malicious nodes
cost with and without our model when the malicious nodes want to be transferred into
the safe group, here c¼0:3. Without our model, the nodes cost about 25 trading cycles
to enter into the safe group, but it will cost about 50 trading cycles with our model.
Obviously, the malicious nodes will cost more time to be transferred into the safe group
from the dubious group with our model (Fig. 3).
55
With our model
Without our model
50
Number of Trading Cycles
45
40
35
30
25
20
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Rate of Malicious Nodes
Fig. 3. Average cost of trading cycles with and without our model
In oder to ensure the positivity of common nodes, this model give a high initial
global trust value to new nodes. So the white-washed node’s close degree may higher
then itself before. However, as the white-washed node’s dwell time is the lowest and
the record of transaction is null, its close degree may be still not high enough to be
selected by safe nodes even though it is higher then before. Thus the white-washed
nodes can not profit more with this model, the behavior of white-washing is
meaningless.
Experiment 2. The Analysis of Malicious Behavior Resistance. In the experiment,
we select four kinds of nodes from the system, there are one malicious node with a high
global trust value, one malicious node with a low global trust value, one normal node
with a high global trust value and one normal node with a low global trust value. The
figure below shows the change of these four nodes‘global trust in 40 trading cycles. As
the the malicious nodes make a malicious behavior in the system, their global trust
178 Z. Wang et al.
0.9
NormalNode1
0.8 NormalNode2
MaliciousNode1
MaliciousNode2
0.7
0.6
Global Trust
0.5
0.4
0.3
0.2
0.1
0 5 10 15 20 25 30 35 40
Trading Cycle
Fig. 4. The change of nodes along with the trading cycle increase
dropped rapidly and reach a stable level about 0.2. On the contrary, the normal
nodes‘global trust go up steadily and reach a stable level about 0.8. Above all, this
model can distinguish the malicious node and normal node well (Fig. 4).
Experiment 3. The Analysis of Malicious Nodes in Different Scale. In this exper-
iment, the critical coefficient of index are 0.3,0.5,0.7 for respectively which is designed
for three different system strategy: optimistic, ordinary and negative, and compare them
with PeerTrust model. The rate of successful interaction in our model is always higher
than PeerTrust model, especially when c¼0:7, the rate is much higher than others.
1
PeerTrust
0.9 Optimistic
Ordinary
0.8
Rate of Successful Interaction
Negative
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Rate of Malicious Nodes
Fig. 5. The rate of successful interaction when the rate of malicious nodes is changing
The Optimization Model of Trust for White-Washing 179
But the number of safe nodes is reduced enormously with this negative system strategy,
it will decrease the positivity of nodes in the system. So we only suggest using negative
system strategy when there are too many malicious node (Fig. 5).
4 Conclusion
Acknowledgement. This work is partly support by the Fundamental Research Funds for the
Central Universities(No.NZ2015108), and the China Postdoctoral Science Foundation funded
project(2015M571752), and the Jiangsu Planned Projects for Postdoctoral Research Funds
(1402033C),and Open Project Foundation of Information Technology Research Base of Civil
Aviation Administration of China(NO.CAAC-ITRB-201405).
References
1. Bilge, L., Strufe, T., Balzarotti, D., Kirda, E.: All your contacts are belong to us: automated
identity theft attacks on social networks. In: Proceedings of the 18th international conference
on World wide web, pp. 551–560. ACM (2009)
2. Jagatic, T.N., Johnson, N.A., Jakobsson, M., Menczer, F.: Social phishing. Commun. ACM
50, 94–100 (2007)
3. Wang, X., Liu, L., Su, J.: Rlm: a general model for trust representation and aggregation.
IEEE Trans. Serv. Comput. 5, 131–143 (2012)
4. Malik, Z., Akbar, I., Bouguettaya, A.: Web services reputation assessment using a hidden
markov model. In: Baresi, L., Chi, C.-H., Suzuki, J. (eds.) ICSOC-ServiceWave 2009.
LNCS, vol. 5900, pp. 576–591. Springer, Heidelberg (2009)
5. Yahyaoui, H.: A trust-based game theoretical model for Web services collaboration. Knowl.-
Based Syst. 27, 162–169 (2012)
6. Huang, K., Yao, J., Fan, Y., Tan, W., Nepal, S., Ni, Y., Chen, S.: Mirror, mirror, on the web,
which is the most reputable service of them all? In: Basu, S., Pautasso, C., Zhang, L., Fu, X.
(eds.) ICSOC 2013. LNCS, vol. 8274, pp. 343–357. Springer, Heidelberg (2013)
180 Z. Wang et al.
7. Malik, Z., Bouguettaya, A.: Rateweb: reputation assessment for trust establishment among
web services. VLDB J. Int. J. Very Large Data Bases 18, 885–911 (2009)
8. Yahyaoui, H., Zhioua, S.: Bootstrapping trust of Web services based on trust patterns and
Hidden Markov Models. Knowl. Inf. Syst. 37, 389–416 (2013)
9. Huang, K., Liu, Y., Nepal, S., Fan, Y., Chen, S., Tan, W.: A novel equitable trustworthy
mechanism for service recommendation in the evolving service ecosystem. In: Franch, X.,
Ghose, A.K., Lewis, G.A., Bhiri, S. (eds.) ICSOC 2014. LNCS, vol. 8831, pp. 510–517.
Springer, Heidelberg (2014)
10. Kudtarkar, A.M., Umamaheswari, S.: Avoiding white washing in P2P networks. In: First
International Communication Systems and Networks and Workshops, 2009, COMSNETS
2009, pp. 1–4. IEEE (2009)
11. Chen, J., Lu, H., Bruda, S.D.: Analysis of feedbacks and ratings on trust merit for
peer-to-peer systems. In: International Conference on E-Business and Information System
Security, 2009, EBISS’09, pp. 1–5. IEEE (2009)
12. de Almeida, R.B., Natif, M., Augusto, J., da Silva, A.P.B., Vieira, A.B.: Pollution and
whitewashing attacks in a P2P live streaming system: analysis and counter-attack. In: 2013
IEEE International Conference on Communications (ICC), pp. 2006–2010. IEEE (2013)
13. Wang, G., Gui, X.-L.: Selecting and trust computing for transaction nodes in online social
networks. Jisuanji Xuebao (Chin. J. Comput.) 36, 368–383 (2013)
14. Xiong, L., Liu, L.: Peertrust: supporting reputation-based trust for peer-to-peer electronic
communities. IEEE Trans. Knowl. Data Eng. 16, 843–857 (2004)
Malware Clustering Based on SNN Density
Using System Calls
1 Introduction
The number of malware of the world current continue to exponential growth. The
current samples’ size store in Sample Library is gradually approaching one hundred
million, the everyday average number of new samples is about 200000. Malware still is
the most main threat to information security. Due to limited resources, in order to use
the limited resources against huge malware more effective, find out new families have
research value is the most important tasks of automatic analysis. In order to complete
this goal, the malware clustering is indispensable work. The effective of clustering has
a close relationship with whether we can quickly find out new families. In the academic
research and practical application in the past, the malware clustering algorithm often
used mainly including simple k - means clustering. But in the case of navigating,
families’ sizes are extreme imbalance. For example, the number of Trojans named Zbot
reached 350000 in Antiy’s sample library, while the Flame only has 57 samples. Find
out new families are composed of very small amounts of samples in large database
through clustering, is also a challenge to the traditional malware clustering algorithm.
The work was supported the project supported by the National Natural Science Foundation of China
(Grant No. 61472437).
2 Related Work
Academia conducted research around malware clustering. Horng-Tzer Wang [1] pro-
vides a system based on structural sequence comparison and probabilistic similarity
measures for detecting variants of malware. The system uses system calls sequences
generated from sandbox represented by Markov chains, and uses the classical k-means
clustering algorithm. Xin Hu [2] proposes the system called MutantX-S, which extracts
the opcode form the system memory mirroring of the malware through shelling pro-
cess, and converts the high-dimensional feature vectors to low-dimensional space. The
system adopts with prototype-based close-to-linear clustering algorithm. Orestis Kos-
takis [3] uses the edit distance graphs (GED) as the basis for measuring the similarity of
graphs. They use the simulated annealing algorithm (SA) and the lower bound of graph
edit distance calculated GED to improve the accuracy and efficiency of calculation of
similarity between the graphs of entire system. Battista Biggio [4] studied how the
behavior of toxic pollution clustering algorithm, they explained the process of poi-
soning the single-linkage hierarchical clustering. In the process, they use Bridge-based
attacks which repeated bridging the two neighboring families to make the algorithm
failed. Yanfang Ye [5] puts forward to establish a system of AMCS which uses the
frequency of opcode and operation code sequence as the feature. The system uses
hierarchical clustering and k-medoids hybrid hierarchical clustering method, this sys-
tem has been used commercially. Roberto Perdisci [6] and others establish the system
VAMO which is an evaluation system to assess the effectiveness of the clustering
algorithm is proposed. Bayer U [7] converts the dynamic analysis of malware to the
record file of behavior, uses LSH to avoid calculating the distances between samples
which below the threshold, they use the results of a hierarchical clustering of LSH to
cluster. Kazuki Iwamoto [8] proposes a automatic malware classification method,
which extracts features by conducting the static analysis of malware and the structure
of source code of malware. In this paper an API sequence’s graph is created. Guanhua
Yan [9] explores the efficiency of several classification methods commonly used for
different characteristic vectors. Silvio Cesare [10, 11] designs Malwise system, the
system is based on the analysis of the information entropy to complete the process of
shelling of malware. It can automatically extract the control flows of samples, by
generating K-Subgraph features and Q-gram characteristics to complete pre filtering,
then using precise and approximate matching algorithm for malware. Shi, Hongbo [12]
uses the frequencies of call DLLs as characteristic vectors, uses the GHSOM(growing
hierarchical self-organizing map) of neural network for malware. Jae wook-Jang [13]
obtains social network of system calls. They generate the completely system calls’
graphs by recording behaviors of the system calls of samples and the social networks
between calls.
It is rarely used clustering algorithms based on density in the study of the early
stage of malware clustering. But in the field of data mining, the clustering algorithms
based on density for high-dimensional data clustering has a good effect. So we will use
the clustering algorithms based on density of malware for the further research. And
clustering algorithms using different features in the prophase study, and I think using
system calls as the features have a good effect. So I will choose system calls as the
features of cluster research.
Malware Clustering Based on SNN Density Using System Calls 183
3 Methodology
3.1 Feature Extract
Pack for malware is an important factor to interference with the analysis of malware,
and has a huge influence on the results of static analysis especially. In order to reduce
the interference caused by packing, we extract the system calls of malware from
dynamic analysis as features. For extracting feature, we get the system calls called in
the process of execution. But different malware from different families may call the
same system calls during execution. To solve this problem, this paper uses the fre-
quencies of the system calls called by malware.
In the process of feature extraction, we first run malware for getting the system
calls’ sequences. Store all the system calls called into the database, and determine
unique ID for each system calls said its position in the feature vectors. Then for each
malware, we statistic all the times of system calls called. Then we place frequencies of
the system calls in the corresponding position of feature vector. Finally, the fixed
feature vectors generated for each malware. According to the preliminary analysis, the
system calls often called by malware are only a small part of all system calls. So the
length of feature vectors generated by this method will be not too long. Figure 1 depicts
the process of feature extraction.
* *
For example, there are two vectors x ¼ ðx1; x2; . . .; xnÞ and y ¼ ðy1; y2; . . .; ynÞ in n
dimensions and the Euclidean distance calculation formula is:
qX
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
n
distance ¼ i¼1
ðxi yiÞ2 : ð1Þ
After Euclidean distance calculation, the distances between the samples are partial.
So sample points are relatively loose. In order to reduce the errors caused by the loose
of samples, we use the Shared Nearest Neighbor (SNN) to define and quantify the
similarities between samples. In order to further calculate each SNN similarities of
malware, we need to calculate the values of kNN for each malware. The definition of
kNN given by Definition 1. In simple terms, the sample p’s kNN contains the k nearest
feature vectors corresponding to the samples to the feature vector of sample p. As
shown in Fig. 2 is the 1NN, 4NN and 5NN of sample p.
Definition 1. k nearest neighbors (kNN) : set p is a sample of data set, and define the
distance between p and the kth nearest neighbor as k-distance(p). The kNN of p con-
tains all of the samples whose distance to p is less then k-distance(p). Formula is:
In essence, as long as the two samples are in each other’s nearest neighbor list, SNN
similarity is the number of their shared adjacent. Its definition is given in Definition 2.
Definition 2. SNN similarity: if p and q are two samples. If NN(p) and NN(q) on behalf
of the set of the nearest neighbor of p and q. the calculation of degree of similarity
between them using the following equation:
Because SNN similarity reflect the local structure of points of the sample space, so
they are relatively insensitive to the change of density and spatial dimensions. The
definition of SNN density given in Definition 3. SNN density measure the extent a
sample is surrounded by similar points.
Definition 3. SNN density. Sample p’s density of SNN is the number of samples
determined by given similarity threshold SIM according to the data set of D. Formula is:
samples; In this step, we use a specified threshold MinPts, find out all samples whose
SNN density greater than MinPts, marked as core samples. (3) generate clusters from
core samples; If the two core samples’ similarity is greater than or equal to Eps, it is
combined into a cluster. (4) then let samples whose similarities with the core samples
are greater than or equal to Eps belong to the clusters composition with the nearest core
samples, and mark the samples whose similarities with the core samples are less than
Eps as noise.
the order of the system calls stored in the database to determine each system calls’
unique ID. Because of the operations of the database are simple, and emphasis on
efficiency. So we adopt SQLite. The 8793 malware call a total of 152 system calls.
Therefore, every malware produces a feature vector length of 152. Each system calls’
corresponding ID is its location in the feature vectors. Then statistics the times of
malware called for each system calls. Then generate feature vectors. For example,
RegOpenKeyExA corresponding ID to 18. If sample p calls RegOpenKeyExA 100
times, the feature vector of p’s value is 100 in location 18.
4.2 SNN
First of all, according to the malware corresponding fixed-length feature vectors, using
the Euclidean distance formula to calculate the distances between the malwares.
Because the limit of laboratory equipment performance, so we have to save the dis-
tances calculated in texts. We create files for each malware to keep distances with all
other malware. So each samples need to compute the distance twice to store in the
different samples’ corresponding text. To induce the time for writing texts, we only
store the top ten nearest distances for each samples.
According to the calculated distances between all malware obtain each corre-
sponding kNN. This paper obtains the kNN of malware when k = 10 and k = 20.
Similarly, the files’ name and the corresponding distances of kNN are stored in files.
Then, calculate SNN similarity between any malware according to each malware’s kNN
files. On the basis of Algorithm 1, if both malware in each other’s kNN list, statistics
the number of malwares’ shared adjacent. Finally create files for each malware which
are used to store the malware SNN similarity with all samples.
SNN similarity between all malware are calculated, according to the different Eps
we can identify malware’s SNN density. In order to better display the effects of different
Eps, in the case of k = 10 we take Eps = 1, 2,…, 10 to respectively calculated SNN
density. The same for k = 20, we take Eps = 5, 6,…, 15 to respectively calculated SNN
density. For each Eps produces SNN density, we set up MinPts value in the same way
in order to select the core samples. Given in Table 1 is under the condition of k = 10,
different number of core samples determined by different Eps and MinPts.
Table 1. The core samples size with different Eps and MinPts in 10NN
Table 2. The number of clusters with different Eps and MinPts in 10NN
Based on SNN density clustering algorithm effect is good for sparse samples. And
extracting system calls as feature vectors has high efficiency, simple algorithm. Using
the system calls’ frequencies can not only reflect to the different samples call the same
system calls, but also there are simple to calculate, and can be done through simple
statistical work. Due to the limited number of system calls, the amount of system calls
often called by malware are more limited. So using system calls frequencies can
generate fixed vectors. Usually the length of the vectors will not be too long, thus it can
reduce the work to reduce the dimension from the high dimensional vectors and save
the time. For incremental samples, it is unnecessary to clustering each time. We can
quickly determine their families by looking for the nearest core samples’ families. To
sum up, after improving the algorithm, it can be used in the clustering analysis of
malware in real environment.
According to the number of clusters the generated by clustering, we can find that
the samples of a family are usually together into one or more clusters. Found by
190 W. Shuwei et al.
looking at the clusters generated by clustering, the algorithm effect is not ideal in the
edge of the families. One obvious phenomenon is a small number of clusters mixed
samples from two families. Reasons for this phenomenon may have (1) separately use
system calls’ frequencies. Do not rule out the possibility of different function with the
same frequencies of system calls. (2) clustering algorithm based on SNN density on
the boundary treatment effect is not very ideal. Two families closely may be connected
by some intermediate samples, so it is difficult to divide the boundary.
Found in statistical accuracy, the families’ names of samples provided by Antiy is
not common. Families whose clustering effect relatively good less than half part of all
families. And these families’ samples labeled by Kaspersky and Microsoft on VT are
belonging to the same families. And other families whose clustering effect are not very
good labeled by Kaspersky and Microsoft on the VT determine the parts of samples of
family does not belong to the family truely. So, in order to further determine the
accuracy of clustering algorithm, it is needed to further screening of samples.
An in-depth analysis for the experimental results to determine Eps and MinPts is
ours future work. And we will do other similar works to compare with this paper’s
work.
6 Conclusion
Malware clustering analysis is one of the important part of the current automatic
analysis of malware. Fixed-length feature vectors formed by the system calls’ fre-
quencies are easy to obtain. It also makes the algorithm can be better applied to the
actual. In this paper, through the calculation of Euclidean distance to determine actual
distance of the samples. Then according to the distance of samples to calculate the
kNN. Determined SNN similarity between any two samples by each samples’ kNN.
Finally by determining the appropriate parameters to product the core samples,
boundary samples and noise samples. The algorithm improved the accuracy of the
clustering of the malware. Laid a good foundation for the further malware automatic
analysis.
References
1. Wang, H.-T., Mao, C.-H., Wei, T.-E., Lee, H.-M.: Clustering of similar malware behavior
via structural host-sequence comparison. In: IEEE 37th Annual Computer Software and
Applications Conference (2013)
2. Hu, X., Bhatkar, S., Griffin, K., Kang, G.: MutantX-S: scalable malware clustering based on
static features. In: Proceedings of the 2013 USENIX Conference on Annual Technical
Conference (2013)
3. Kostakis, O.: Classy: fast clustering streams of call-graphs. Data Min. Knowl. Dis. 28,
1554–1585 (2014)
4. Biggio, B., Rieck, K., Ariu, D., Wressnegger, C., Corona, I., Giacinto, G., Rol, F.: Poisoning
behavioral malware clustering. In: Proceedings of the 2014 Workshop on Artificial
Intelligent and Security Workshop (2014)
Malware Clustering Based on SNN Density Using System Calls 191
5. Ye, Y., Li, T., Chen, Y., Jiang, Q.: Automatic malware cate-gorization using cluster
ensemble. In: Proceedings of the 16th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pp. 95–104(2010)
6. Perdisci, R., ManChon, U.: VAMO: towards a fully automated malware clustering validity
analysis. In: Proceedings of the 28th Annual Computer Security Applications Conference
(2012)
7. Bayer, U., Comparetti, P.M., Hlauscheck, C., et al.: Scalable, behavior-based malware
clustering. In: 16th Symposium on Network and Distributed System Security (NDSS) (2009)
8. Iwamoto, K., Wasaki, K.: Malware classification based on extracted API sequences using
static analysis. In: Proceedings of the Asian Internet Engineeering Conference (2012)
9. Yan, G., Brown, N., Kong, D.: Exploring discriminatory features for automated malware
classification. In: Rieck, K., Stewin, P., Seifert, J.-P. (eds.) DIMVA 2013. LNCS, vol. 7967,
pp. 41–61. Springer, Heidelberg (2013)
10. Cesare, S., Xiang, Y., Zhou, W.: Malwise: an effective and efficient classification system for
Packed and Polymorphic Malware. IEEE Trans. Comput. 62, 1193–1206 (2013)
11. Cesare, S., Xiang, Y., Zhou, W.: Control flow-based malware variant detection. IEEE Trans.
Dependable Secure Comput. 11, 304–317 (2014)
12. Hongbo, S., Tomoki, H., Katsunari, Y.: Structural classification and similarity measurement
of malware. IEEJ Trans. Electr. Electron. Eng. 9, 621–632 (2014)
13. Jang, J.-W., Woo, J., Yun, J., Kim, H.K.: Mal-netminer: malware classification based on
social network analysis of call graph. In: Proceedings of the Companion Publication of the
23rd International Conference on World Wide Web Companion (2014)
Analyzing Eventual Leader Election Protocols
for Dynamic Systems by Probabilistic
Model Checking
1 Introduction
STABLE is the set of processes which will never crash or leave since enter the
system. We note that this set is similar to its counterpart in the static model, i.e., the set
of correct processes. Recall that in the static model, if a process does not crash during a
run, it is considered to be a correct one; otherwise, it is considered as faulty. Moreover,
in the static model, there is an integer n-f which denotes the number of correct pro-
cesses and guarantees these processes not to be blocked forever. Similarly, in the
dynamic model, a value a is proposed to define the number of static process and
prevent processes from blocking forever. Therefore, we have the following corre-
spondence between the static model and the dynamic model, i.e., n- f corresponds to a,
and pi corresponds to i 2 STABLE.
In light of the above considerations, the assumption regarding the query response
pattern can be formulated MPdsw :
In the dynamic system, there are time t, process pi that i 2 STABLE, and set Q of
0
processes such that 8t t, we have:
0
(1) Qupðt Þ;
T 0
(2) i 2 j2Q winningj t ; and
0 T 0
(3) 8x 2 up t : Q winningx t 6¼ null:
Intuitively, after time t, there exist a subset Q of a universal set up(t’) of the
dynamic system, and a stable process denoted by pi. For every process in Q, pi always
belongs to the intersection of their winningj(t’). And simultaneously, for every process
existing in up(t’), the intersection of Q and winningx(t’) is always not empty.
The leader election protocol is based on the existing work [2]. In our paper, we borrow
the idea from the software engineering community and re-examine the protocol model
through probabilistic model checking. Particularly, we utilize PRISM model checker to
conduct the analysis which is lacking in [2]. Such an analysis allows us to investigate
the protocol from a different perspective. More importantly, we extend the model by
relaxing crucial assumptions previously made on the environment (i.e., the commu-
nication is reliable), and perform a considerably more detailed quantitative analysis.
Firstly, it sends a QUERY message and then it keeps waiting until a RESPONSE
messages have been received. Next it updates RECFROMi by computing the union of
the rec_fromi of all the processes sending RESPONSE. Afterwards the trusti should be
modified by set intersection. Moreover, it updates its rec_fromi set based on the ids of
those processes which have sent RESPONSE messages. If this value has been changed,
then process pi should broadcast another message named TRUST to the system so as
that all processes in the system can adjust their leader accordingly as shown in
Algorithm 3.
The operation of every process is the same as the way of pi introduced above. After
some time, a unique leader will be eventually elected once all trust sets keep stable
(unchanged) and the identity of leader of all processes point to the same value.
In addition, each component of data structure of a process is introduced as below
and its value should be initialized. We use the data structure of process pi as an
example.
Among all variables, rec_fromi is the set of processes where pi receives RESPONSE
messages. RECFROMi is the union set of rec_from. And trusti is the set of candidate-
leaders. Besides, log_datei is a logical time defining the age of trusti. Moreover, leaderi
is the leader of pi.
All variables should be initialized as follows.
Analyzing Eventual Leader Election Protocols 197
For the latter, the local variables of every process are classified into three cate-
gories. In the first category (shown in Listing 1.1), each variable ranges over from 0 to
N and is initialized to be N. It should be noticed that N here is equal to (2n − 1) where n
is the number of the processes in the system. Actually in our example, its value is set to
15 (24 − 1). The reason will be made clear later. In the second category (shown in
Listing 1.2), each variable is of Boolean value and the role is to classify whether or not
the required messages have been received. In the third category (shown in Listing 1.3),
every variable is a counter which is used to increase the values of execution steps or
any other similar information. Therefore, processi in the system has its data structure
as below.
In the first category (shown in Listing 1.1), rec_from_i is a decimal number which
in fact should be converted to a binary string. This string represents a set of processes
sending RESPONSE messages. And similarly, RECFROM_i is the union of rec_from
of the processes in rec_from_i, trust_i is a candidate leader set and new_trust_i is a
Analyzing Eventual Leader Election Protocols 199
temporary trust_i. Moreover, a_i, b_i and c_i are also temporary variables which store
the variables participating in the AND and OR operations.
In the second category (shown in Listing 1.2), query_i denotes whether or not
processi has sent QUERY message. And response_ij represents whether processi has
sent the RESPONSE message to processj.
In the third category (shown in Listing 1.3), sign_ij is used to record the steps,
despite the fact that it has only two choices. s_i is the same as sign_ij which records the
execution steps of every process. And log_date_i is a log variable used to note the
logical time of trust_i. And obviously leader_i is the leader belonging to processi.
Key Procedures. According to the given algorithm, we design PRISM modules for
every process. The algorithm is shown in Sect. 3.1. It should be noticed that we assume
that α here is equal to three for illustration.
We note that since there are no APIs or relevant methods supporting the set
operations in PRISM, we translate two sets to binary strings and then use these two
strings to complete the AND and OR operations. Evidently they are equivalent to the
intersection and union operations of sets respectively.
In brief, Listing 1.4 demonstrates the procedure of calculating rec_from_i. Once the
guard is satisfied, rec_from_i will change its value by means of given formula. The
guard consists of three conditions: (1) whether or not the leader has been elected
(denoted by signal = 0); (2) whether or not reply_i equals to three (denoted by re-
ply_i = 3); (3) processi keeps state1 (denoted by s_i = 1), and then rec_from_i cal-
culates its value. For example, if processi has received RESPONSE messages from
process1, process3 and process4, then rec_from_i will change its value to 13 since
response_1i is 1, response_3i is 4 and response_4i is 8 and consequently the sum of
them is 13. Simultaneously, a_i will change to rec_from_1, b_i will adjust to
rec_from_3 and c_i will be modified by rec_from_4.
200 J. Gu et al.
The algorithm regarding to TRUST has three branches (shown in Listing 1.6). Each
time, one of them must be executed. Considering this trait, we use a synchronization
sign to tackle this issue. The above demonstration can successfully fulfill the mission.
The unique leader of the system will be elected once all leader_i points to the same
objects (shown in Listing 1.7). And the time after the unique leader has been elected, it
always keeps the status that “elected”. Therefore, we design the solution above.
Usually, users are interested in whether or not the protocol can elect a unique leader in
the dynamic system as fast as possible, so the efficiency is a crucial concern. Con-
sidering the characteristic of the system and analyzing several possible factors, we
make assumptions that the time of electing a unique leader of the system is related to
the number of processes in the system and also the channel reliability of processes
among the whole system communicating with each other via sending query-response
primitive messages.
the average steps of execution they spent on electing a leader. We set that the prob-
ability of sending RESPONSE messages is 1.0 because this time we only concentrate
on the influence of the size of system rather than other factors. Based on the above
setup, we record the results of each system as follows:
In Fig. 1, apparently with the system size increasing, the average number of exe-
cution steps that system spends on electing a leader grows rapidly. From the first
experiment regarding the system consisting of three processes, the number of steps is
around 88 on average. While adding a new process to the system, the average number
grows to 452. When we continue to add another process, the result reaches 2598,
approximately 5 times of the former one. In Fig. 2, we use a box plot to illustrate that
with the increasing size of the system, the variance of the number of execution steps
grows rapidly as well. Therefore we conclude that the size of the system must be
controlled well. Otherwise, leader election in the dynamic system will be too costly to
be affordable.
of electing a leader is influenced by the probability of sending messages. This fact can be
revealed by the number of execution steps, and the result is illustrated by Figs. 3 and 4.
In Fig. 3, it is obvious that the average steps of execution decrease smoothly with
the increasing of probability of successfully sending RESPONSE messages. When the
probability is 0.5, the cost of average execution steps equals to 1551 steps. And once
the probability increases to 0.9, the cost diminishes to 914 steps.
Analyzing Eventual Leader Election Protocols 203
In Fig. 4, we can clearly observe that the variance of execution steps decreases with
the rising probability of successfully sending messages. Simultaneously the number of
outliers also reduces with this rising. Because of the conditions demonstrated above, we
can conclude that the ratio of sending RESPONSE messages successfully plays a vital
role in the efficiency of leader election in the dynamic system and thus we should
guarantee the stability of the communication channels in the system in order to improve
the efficiency.
5 Related Work
Most existing leader election algorithms are based on static systems [14–16], while in
contrast, their counterparts relied on dynamic systems have attracted less attention.
However, currently various applications are based on the dynamic system and thus the
status of the dynamic system cannot be neglected any longer. The importance of leader
election, which is one the fundamental building blocks of distributed computing,
should be highlighted in the dynamic system.
Mostefaoui et al. [2] adapted an existing eventual leader election protocol designed
for the static model and then translated it to a similar model suiting in the dynamic
systems by means of comparing those two models’ traits and adapting some specific
features. In the paper, it was also theoretically proved that the resulting protocol was
correct within the proposed dynamic model.
In [3], we also proposed a hierarchy-based eventual leader election model for
dynamic systems. The proposed model was divided into two layers. The main idea in
the lower part was to elect cluster-heads of every cluster while the target in the upper
one was to elect global leader of the whole system from these existing cluster-heads.
The concept of the model was to distinguish the important processes from all processes
204 J. Gu et al.
and then paid more attention to those selected ones in order to diminish the size of
concerns and then improve the efficiency.
In addition, Larrea et al. [17] has pointed out the details and the keys to elect an
eventual leader. In other words, to achieve the goal of electing an eventual leader, there
are some significant conditions which must be satisfied, such as stability and synchrony
condition-a leader should be elected under the circumstance of no more processes
joining in or leaving out the system. The proposed algorithm relies on entering time
stamp comparing.
Meanwhile, there is another line of work with regard to applying formal methods to
protocol verification, for example [18, 19]. In [18], Havelund et al. employed the
real-time verification tool UPPAAL [20] to perform a formal and automatic verification
of a protocol existing in reality in order to demonstrate how model checking had an
influence on practical software development. Although it is unrelated with eventual
leader election, it demonstrates the feasibility of applying such technique to real world
protocols. In [19], Yue et al. used PRISM to present an analysis of a randomized leader
election where some quantitative properties had been checked and verified by PRISM.
However, the work did not cover the issues for dynamic systems, which is the main
focus of the current paper.
In this paper, we have investigated and analyzed properties of eventual leader election
protocols for dynamic systems from a formal perspective. Particularly, we employ
PRISM to model an existing protocol, and illustrate the average election round and its
scalability via simulation. Moreover, we relax the assumptions made by the original
protocol and utilize probability to model the reliability of message channel. We also
illustrate relationships between the reliability and the efficiency of election rounds
taken by the revised protocol based on probabilistic model checking. In the future, we
plan to extend our model and cover more performance measure such as the energy
assumption to give a more comprehensive analysis framework.
Acknowledgements. The work was partially funded by the NSF of China under grant
No.61202002, No.61379157 and the Collaborative Innovation Center of Novel Software
Technology and Industrialization.
References
1. Yang, Z.W., Wu, W.G., Chen, Y.S., Zhang, J.: Efficient information dissemination in dynamic
networks. In: 2013 42nd International Conference on Parallel Processing, pp. 603–610 (2013)
2. Mostefaoui, A., Raynal, M., Travers, C., Patterson, S., Agrawal, D., Abbadi, A.E.: From
static distributed systems to dynamic systems. In: Proceedings of the 24th Symposium on
Reliable Distributed Systems (SRDS05), IEEE Computer, pp. 109–118 (2005)
Analyzing Eventual Leader Election Protocols 205
3. Li, H., Wu, W., Zhou, Yu.: Hierarchical eventual leader election for dynamic systems. In:
Sun, X.-h., Qu, W., et al. (eds.) ICA3PP 2014, Part I. LNCS, vol. 8630, pp. 338–351.
Springer, Heidelberg (2014)
4. Chen, S., Billings, S.A.: Neural networks for nonlinear dynamic system modelling and
identification. Int. J. Control 56(2), 319–346 (1992)
5. Merritt, M., Taubenfeld, G.: Computing with infinitely many processes. In: Herlihy, M.P.
(ed.) DISC 2000. LNCS, vol. 1914, pp. 164–178. Springer, Heidelberg (2000)
6. Guerraoui, R., Hurfin, M., Mostéfaoui, A., Oliveira, R., Raynal, M., Schiper, A.: Consensus
in asynchronous distributed systems: a concise guided tour. In: Krakowiak, S., Shrivastava,
S.K. (eds.) BROADCAST 1999. LNCS, vol. 1752, pp. 33–47. Springer, Heidelberg (2000)
7. Jha, S.K., Clarke, E.M., Langmead, C.J., Legay, A., Platzer, A., Zuliani, P.: A bayesian
approach to model checking biological systems. In: Degano, P., Gorrieri, R. (eds.) CMSB
2009. LNCS, vol. 5688, pp. 218–234. Springer, Heidelberg (2009)
8. Bradley, A.R.: SAT-based model checking without unrolling. In: Jhala, R., Schmidt, D.
(eds.) VMCAI 2011. LNCS, vol. 6538, pp. 70–87. Springer, Heidelberg (2011)
9. Kwiatkowska, M., Norman, G., Parker, D.: PRISM: probabilistic symbolic model checker.
In: Field, T., Harrison, P.G., Bradley, J., Harder, U. (eds.) TOOLS 2002. LNCS, vol. 2324,
pp. 200–204. Springer, Heidelberg (2002)
10. Kwiatkowska, M., Norman, G., Parker, D.: PRISM 4.0: verification of probabilistic
real-time systems. In: Gopalakrishnan, G., Qadeer, S. (eds.) CAV 2011. LNCS, vol. 6806,
pp. 585–591. Springer, Heidelberg (2011)
11. Zhou, R.F., Hwang, K.: Powertrust: A robust and scalable reputation system for trusted
peer-to-peer computing. IEEE Trans. Parallel Distrib. Syst. 18(4), 460–473 (2007)
12. Vaze, R., Heath, R.W.: Transmission capacity of ad-hoc networks with multiple antennas
using transmit stream adaptation and interference cancellation. IEEE Trans. Inf. Theory
58(2), 780–792 (2012)
13. Aguilera, M.K.: A pleasant stroll through the land of infinitely many creatures. ACM Sigact
News 2, 36–59 (2004)
14. Gupta, I., van Renesse, R., Birman, K.P.: A probabilistically correct leader election protocol
for large groups. In: Herlihy, M.P. (ed.) DISC 2000. LNCS, vol. 1914, pp. 89–103. Springer,
Heidelberg (2000)
15. Mostefaoui, A., Raynal, M., Travers, C.: Crash-resilient time-free eventual leadership. In:
Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems,
2004, pp.208–217. IEEE (2004)
16. Bordim, J.L., Ito, Y., Nakano, K.: Randomized leader election protocols in noisy radio
networks with a single transceiver. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P.,
Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330, pp. 246–256. Springer,
Heidelberg (2006)
17. Larrea, M., Raynal, M.: Specifying and implementing an eventual leader service for
dynamic systems. In: 2011 14th International Conference on Network-Based Information
Systems (NBiS), pp. 243–249 (2011)
18. Havelund, K., Skou, A., Larsen, K.G., Lund, K.: Formal modeling and analysis of an
audio/video protocol: an industrial case study using uppaal. In: IEEE 18th Real-Time
Systems Symposium, 2p (1997)
19. Yue, H., Katoen, J.-P.: Leader election in anonymous radio networks: model checking
energy consumption. In: Al-Begain, K., Fiems, D., Knottenbelt, W.J. (eds.) ASMTA 2010.
LNCS, vol. 6148, pp. 247–261. Springer, Heidelberg (2010)
20. Behrmann, G., David, A., Larsen, K.G.: A tutorial on UPPAAL. In: Bernardo, M., Corradini,
F. (eds.) SFM-RT 2004. LNCS, vol. 3185, pp. 200–236. Springer, Heidelberg (2004)
A Dynamic Resource Allocation Model
for Guaranteeing Quality of Service
in Software Defined Networking Based
Cloud Computing Environment
1 Introduction
security etc. And virtualization technology is the core of cloud computing that allows
user to access network resources by renting virtual machines (VMs) [2].
There are various network data on cloud as the result of Internet users have
increased tremendously over the past few years. Some applications, such as multi-
media, video conferencing and voice over IP (VoIP) need high bandwidth resource and
computer power, for example throughput for real-time multimedia services, low latency
(delay) for VoIP, or online gaming with low jitter. Cloud providers encounter several
challenges when they provide services for many cloud users simultaneously [3]. One of
the most important is satisfy QoS requirement for each multimedia user or each
application user. In the IP network, a variety of techniques have been proposed to
achieve preferable quality of service including overprovisioning, buffering, traffic
shaping, the bucket algorithm, resource reservation etc. Over the past decade, the
Internet Engineering Task Force (IETF) has explored several QoS (quality of service)
architectures, but none has been truly successful and globally implemented.
Software Defined Networking (SDN) is a novel network paradigm that decouples
the forwarding plane and control plane [4]. The most significant feature of SDN is to
provide centralized control and global view of the network. Then it can realize
fine-grained control for network flows and allocate the resource reasonably. It enables
network as a programmable component of the larger cloud infrastructure and provides
virtualization of the underlying network [5]. However, SDN has not been used
extensively as the cloud computing provider. Therefore, we present a QoS management
framework in SDN-based cloud computing environment, which take full advantage of
centrality and programmability of SDN. And the network controller serves as service
manager in cloud computing architecture, which offers resource reservation and
on-demand bandwidth. OpenFlow is the communication interface between the con-
troller and forwarding layers of SDN architecture. The forwarding layer can be com-
posed of abundant virtual machines (VMs) acting as cloud providers. Our work is focus
on SDN controller to implement bandwidth allocation and resource management for
the requirement of cloud users.
We classify cloud user service traffic flows into QoS flow and best-effort flow
which following shortest path originally. The QoS flow means that it needs more
network resource and QoS parameters, such as bandwidth, packet loss, jitter, delay,
throughput, must be guaranteed. The controller will run dynamically routing algorithms
to calculate available route when the current link load exceed the load threshold.
Depending on the requirement of users or application feature, we differentiate traffic
flows into different level priority and combine the queue technique to guarantee the
transmission of the high priority flow. Some numerical results are provided and ana-
lyzed to show that our framework can guarantee the QoS and satisfy the quality of
experience (QoE) for cloud users.
The rest of this paper is organized as follows. Section 2 reviews the main work for
QoS provisioning in cloud computing environment. Section 3 discusses the QoS
implement framework we proposed in this paper and optimization scheme for route
calculate. Section 4 presents the experiments verification for bandwidth allocation and
resource reservation for cloud users. Finally, Sect. 5 concludes the research work.
208 C. Xu et al.
2 Related Work
Internet users have shared amount of data such as multimedia data and real-time data
on the cloud. Therefore, QoS is one of the essential characteristic to be achieved in the
field of cloud computing and SDN. Service Level Agreements (SLAs) are established
and managed between cloud service provider and cloud user to negotiate the resource
allocation and service requirement [6]. However, SALs are mostly plain written doc-
uments from lawyers, which are static and published on the providers’ web sites. It is
restrict to achieve service satisfaction between network service provider and user
dynamically.
Reference [6] presents an automated negotiation and creation of SLAs for network
services through combining WS-agreement standard with OpenFlow standard, which
deliver QoS guarantees for VMs in the cloud. Their approach allows customers to
query and book available network routes between their hosts including a guaranteed
bandwidth. Reference [7] proposes a unified optimal provisioning algorithm that places
VMs and network bandwidth to minimize users’ costs. They have accounted for
uncertain demand by formulation and solving a two-stage stochastic optimization
problem to optimally reserve VMs and bandwidth. Akella [8] proposes a QoS-
guaranteed approach for bandwidth allocation by introducing queue techniques for
different priority users in SDN based cloud computing environment. The flow con-
troller selects the new path by using greedy algorithm. However, their approach is
implemented in OpenVswitch [9] and they have not employed the SDN controller.
Then their scheme lacks of centralized control and programmability. Besides, they will
contribute much work in OpenVswitch to meet their design.
In order to implement QoS control, there are many innovation approaches have
been researched in SDN. Reference [10] describes an adaptive measurement frame-
work that dynamically adjusts the resources devoted to each measurement task, while
ensuring a user specified level of accuracy. To efficiently use TCAM resources, [11]
proposes a rule multiplexing scheme, which study the rule placement problem with the
objective of minimizing rule space occupation for multiple unicast sessions under QoS
constraints. Reference [12] proposed CheetahFlow that a novel scheme to predict
frequent communication pairs via support vector machine and detect the elephant flows
and rerouted to the non-congestion path to avoid congestion.
Some routing algorithms have been used to realize dynamical management of
network resource. Reference [13] describes the two algorithms that are randomized
discretization algorithm (RDA) and path discretization algorithm (PDA) to solve the -
approximation of delay constrained least-cost routing (DCLC). Their simulations show
that RDA and PDA run much faster than other route algorithm on average. Reference [14]
described the distributed QoS architectures for multimedia streaming over SDN. They
consider an optimization model to meet the multimedia streaming requirements in the
inter-domain. However, they only employing dynamic routing to minimize the adverse
effects of QoS provisioning on best-effort flows. It is possible that there are no enough
available paths to be rerouted for QoS flows which will lead to poor performance for
transmission of QoS flow in complex network. In this research, we select available
path by routing algorithm when the current link cannot meet the users’ requirement.
A Dynamic Resource Allocation Model for Guaranteeing Quality of Service 209
Particularly, we combine the queue technique to satisfy the requirement of high priority
flow when there no candidate paths calculated by routing algorithm.
The proposed QoS architecture is designed in SDN controller which provides reser-
vation and on-demand bandwidth service. We send statistics request to forwarding
devices by using OpenFlow protocol to acquire state information such as network load,
delay, jitter, and available bandwidth of link, which are indispensable element to
implement QoS control [15]. In order to mitigate the overload of communication
between controller and switches, we create the address resolution protocol (ARP) proxy
to learn MAC address. OpenFlow standard, which enables remote programming of the
forwarding plan, is first SDN standard and a vital element of an open software defined
network architecture. Therefore, our framework can run multiple routing algorithms
simultaneously to allocate network resource dynamically in SDN controller. The design
principle is that our architecture must be compatible with current OpenFlow specifi-
cation which need not extra modification for switch hardware, OpenFlow protocol and
end hosts. This approach ensures that our solution is practical with little extra effort to
upgrade.
User
OpenFlow
switch/router
6. Rule Generator: This function will package the route and control information into
flow entry and send it to forwarder devices.
In addition to this, we combine queue technique to guarantee high priority flow
when there no candidate path calculated by routing optimization algorithm, which will
lead to heavy congestion and packet loss. We will configure the multiple queues on
forwarders to support resource reservation when the topology starts up.
To implement QoS control in SDN, we ignore the communication overhead pro-
duced between controller and forwarder devices and the extra cost from routing
optimization algorithm. We assume that there exists multipath in the topology so that
the rerouting module can select the candidate path. Our framework resolves three
challenge problems. Firstly, we send the stats request message which supported by
OpenFlow specification to the OpenFlow-enable switch to acquire the statistics that
will be analyzed to determine resource scheduling. Secondly, In order to improve the
efficiency of controller, we delete the original flow entry to update flow entry generated
by route calculation algorithm instead of modifying the original flow entry directly.
Finally, we map different level flow marked by flow classifier module into corre-
sponding queue that had configured by OVSDB protocol. The workflow of our
framework is described below:
• We will configure different level queues on each port of OpenFlow switches by
using the OVSDB protocol. It is performed when the controller establishes an
OpenFlow session with OpenFlow switches.
• When the new flow gets to network, the controller will choose the shortest path
depending on Dijkstra algorithm supported by routing calculation module.
• Flow and port statistics are collected from the switches periodically through
OpenFlow protocol. They are analyzed in state manager module to estimate the
network condition.
A Dynamic Resource Allocation Model for Guaranteeing Quality of Service 211
Where ci,j and di,j are cost and delay variation coefficients for the arc (i,j),
respectively. The CSP problem is to minimize path cost function fC(r) of routing path r
subject to a given constraint Dmax as below,
The variable di,j indicates the delay variation for traffic on link (i,j), pi,j is the packet
loss measure, and β is the scale factor.
We employ the LARAC (Lagrange Relaxation based Aggregated Cost) algorithm
to resolve CSP problem, which follows Hilmis scheme [16, 17]. Firstly, it will find the
shortest path rC by Dijkstra algorithm depending on path cost c. If the shortest path rC
satisfies the constrain condition Dmax, it will be the feasible path; otherwise, it will
212 C. Xu et al.
calculate the shortest path rD constrained by delay variation. If this path does not satisfy
the delay variation constrain Dmax, there will no feasible path in the current network
and the algorithm will stop; otherwise, it will call circulation iteratively until finding the
feasible path that satisfies constraint.
It is shown that LARAC is a polynomial time algorithm that efficiently finds a good
route in O ([n + mlogm]2), where n and m are the number of nodes and link, respec-
tively. LARAC also provides a lower bound for the theoretical optimal solution, which
leads us to evaluate the quality of the result. Moreover, by further relaxing the opti-
mality of paths, an easy way is provided to control the trade-off between the running
time of the algorithm and the quality of the found paths.
4 Experiment Validation
In this section, we introduce the prototype implementation and validation for our
approach presented in Sect. 3. We test the emulation experiment in Mininet 2.2.0 [18]
environment that will construct the network topology. The hosts created by Mininet can
be regard as cloud provider and cloud user. The SDN controller is responsible for
resource management and control by connecting the Mininet environment.
Our scheme is implemented in RYU controller 3.15 [19] which is a component
based software defined networking framework and it provides software components
with well-defined API that make it easy for developers to create new network man-
agement and control applications. The experiment topology is shown in Fig. 2.
A Dynamic Resource Allocation Model for Guaranteeing Quality of Service 213
It consists of eight OpenFlow enabled switches to build prototype network. In this test
scenario, we use OpenVswitch 2.3.1 which is a production quality multi-layer virtual
switch as forwarding devices and OpenFlow 1.3 as communication protocol between
controller and forwarder. We define the H1 as cloud user and H2, H3, H4 is cloud
provider that will provide service for H1. We will observe the performance of our QoS
control approach by measure the bandwidth, packet loss and delay variation.
In order to simulate the actual cloud user application requirement, we customize
iperf network test tool to generate traffic flows [20]. In this test scheme, we employ
three test flows shown in the Table 1 as cloud users’ application requirement. QoS flow
1 need more network resource then it is given highest priority among the network. The
best-effort traffic is the background flow to change the congestion degree in test net-
work. To observe the direct effect of QoS control on the flow performance, we use
UDP to set specified bandwidth for two QoS flows and best-effort traffic.
For QoS flow1, QoS flow 2 and best-effort traffic, they have different level priority
and bandwidth requirement. For the purpose of achieving to congestion, we set
specified bandwidth constraint for all links in test network. Firstly, H1 send the QoS
flow 1 and QoS flow 2 to the corresponding destination hosts H2 and H3 while
background flow started at 20 s. Initially, the controller will calculate shortest path for
214 C. Xu et al.
new flow by Dijkstra algorithm and install flow entries into OpenFlow switches.
The total bandwidth of three traffic flows will exceed the link threshold when the best-
effort traffic gets to network at 20 s. Then it will be congestion condition on shortest
path and cannot guarantee the requirement of services without QoS control. However,
our QoS control framework will reallocate the network resource for traffic flows to
avoid the congestion problem occurred on shared path.
In the Fig. 3, we present a comparison of our experiment results by using the
proposed approach with the ones without a use of the proposed approach. We
implement our QoS control scheme at 40 s and observe the change of throughput. The
throughput of two QoS flows and best-effort have increased and trended to steady state
quickly. The routing algorithm calculates an available route for QoS flow1. Then QoS
flow 1 can be guaranteed by selecting another lowest load path. Experiment shows that
routing algorithm has not found appropriate route for QoS flow 2 because of the
constraint of cost and delay. However, QoS flow 2 has higher priority than best-effort
so it can be guaranteed by mapping into high priority queue. In general, our scheme
ensures highest priority traffic firstly and tries to mitigate the effect on best-effort traffic.
80
70
Th
ro 60
ug
hp 50
ut
(M 40
bp
s)
30
20
QoS Flow 1
10 QoS Flow 2
Best−Effort
0
0 10 20 30 40 50 60 70 80
Time (s)
The delay variation of three test flows is shown in Fig. 4. The QoS flow 1 and QoS
flow 2 have suffered from great fluctuation when best-effort gets to network at 20 s.
Three test flows keep jittering between 20 s and 40 s because of without our QoS
control scheme. The delay variation of QoS flow 1 and QoS flow 2 has decreased after
we turn on QoS control at 40 s. Although QoS flow 2 still exist fluctuation that
influenced by best-effort. It is acceptable for many applications to satisfy the quality of
transmission. The background traffic is limited to provide available bandwidth for QoS
flow. We also evaluate the packet loss ratio among the implement for our proposed
scheme in Fig. 5. Similarly, the phenomenon of packet loss has disappeared when QoS
flow 1 reroute another path and best-effort is mapped into low priority queue.
A Dynamic Resource Allocation Model for Guaranteeing Quality of Service 215
2.5
QoS Flow 1
QoS Flow 2
D Best−Effort
2
el
ay
Va
ria 1.5
tio
n
(m 1
s)
0.5
0
0 10 20 30 40 50 60 70 80
Time (s)
0.8
QoS Flow 1
0.7 QoS Flow 2
Best−Effort
Pa
ck 0.6
et
Lo 0.5
ss
R 0.4
ati
o
0.3
0.2
0.1
0
0 10 20 30 40 50 60
Time (s)
Fig. 5. Packet Loss Ratio for Two QoS Flows and Best-effort
5 Conclusion
OpenVswitch. Experimental results have shown that our approach can provide an
end-to-end QoS guarantees for cloud user. In the future work, we will validate our QoS
framework by OpenFlow physical switch in the large-scale network.
Acknowledgements. We thank the reviewer for their valuable feedback. This work was sup-
ported by the Industry-University-Research Combination Innovation Foundation of Jiangsu
Province (No. BY2013003-03) and the Industry-University-Research Combination Innovation
Foundation of Jiangsu Province (No. BY2013095-2-10).
References
1. Mell, P., Grance, T.: The NIST definition of cloud computing (v15). Technical report,
National Institute of Standards and Technology (2011)
2. Wood, T., Ramakrishnan, K.K. Shenoy, P., Merwe, J.: Cloudnet: Dynamic pooling of cloud
resources by live wan migration of virtual machines. In: 7th ACM SIGPLAN/SIGOPS
international conference on virtual Execution Environments, pp. 121–132. ACM Press
(2011)
3. Hoefer, C.N. Karagiannis, G.: Taxonomy of Cloud Computing Services. In:
IEEEGLOBECOM Workshops, pp. 1345–1350 (2010)
4. McKeown, N., Anderson, T. Balakrishman, H., Parulkar, G., Peterson, L., Rexford, J.,
Shenker, S., Turner, J.: Openflow: enabling innovation in campus networks. In:
ACM SIGCOMM Computer Communication Review, pp. 69–74. ACM Press (2008)
5. Tootoonchian, A., Gorbunov, S. Ganjali, Y. Casado, M. Sherwood, R.: On controller performance
in software-defined networks. In: USENIX workshop on Hot Topics in Management of Internet,
Cloud, and Enterprise Networks and services (Hot-ICE), pp. 893–898. IEEE Press (2014)
6. Korner, M., Stanik, A. Kao, O.: Applying QoS in software defined networks by using
WS-Agreement. In: 6th IEEE International Conference on Cloud Computing Technology
and Science, pp. 893–898. IEEE Press (2014)
7. Chase, J., Kaewpuang, R. Wen, Y.G. Niyato, D.: Joint virtual machine and bandwidth
allocation in software defined network (SDN) and cloud computing environments. In: IEEE
International Conference on Communications (ICC), pp. 2969–2974. IEEE Press (2014)
8. Akella, A.V., Xiong, K.Q.: Quality of Service (QoS) guaranteed network resource allocation
via software defined networking (SDN). In: 12th IEEE International Conference on
Dependable, Autonomic and Secure Computing, pp. 7–13. IEEE Press (2014)
9. OpenvSwitch: A production quality, multilayer virtual switch. http://openvswitch.github.io/
10. Moshref, M., Yu, M.L., Govindan, R., Vahdat, A.: DREAM: dynamic resource allocation
for software-defined measurement. In: ACM SIGCOMM Computer Communication
Review, pp. 419–430. ACM Press (2014)
11. Huang, H., Guo, S., Li, P., Ye, B.L., Stojmenovic, I.: Joint optimization of rule placement
and traffic engineering for QoS provisioning in software defined network. In: IEEE
Transactions on Computers. IEEE Press (2014)
12. Su, Z., Wang, T., Xia, Y., Hamdi, M.: CheetahFlow: towards low latency software defined
network. In: IEEE International Conference on Communications, pp. 3076–3081. IEEE
Press (2014)
13. Chen, S., Song, M., Sahni, S.: Two techniques for fast computation of constrained shortest
paths. In: IEEE Transactions on Networking, pp. 1348–1352. IEEE Press (2008)
A Dynamic Resource Allocation Model for Guaranteeing Quality of Service 217
14. Egilmez, H.E., Tekalp, M.: Distributed QoS architectures for multimedia streaming over
software defined networks. In: IEEE Transactions on Multimedia, pp. 1597–1609 IEEE
Press (2014)
15. Kim, H., Feamster, N.: Improving network management with software defined networking.
In: IEEE Communications Magazine, pp. 114–119. IEEE Press (2013)
16. Juttner, A., Szviatovski, B., Mecs, I., Rajko, Z.: Lagrange relaxation based method for the
QoS routing problem. In: 20th Annual Joint Conference of the IEEE Computer and
Communications Societies, pp. 859–868. IEEE Press (2001)
17. Egilmez, H.E. Civanlar, S. Tekalp, A. M.: An Optimization Framework for QoS Enabled
Adaptive Video Streaming Over OpenFlow Networks. In IEEE Transactions on Multimedia,
pp. 710–715. IEEE Press (2013)
18. Mininet: Network Emulator. http://yuba.stanford.edu/foswiki/bin/view/OpenFlow/Mininet
19. RYU: Component-based Software defined Networking Framework. http://osrggithub.io/ryu/
index.htmlMininet
20. Iperf: TCP/UDP Bandwidth Measurement Tool. https://iperf.fr/
Research and Development of Trust
Mechanism in Cloud Computing
1 Introduction
Cloud computing is a new kind of computing model, which takes resource rent,
application hosting and outsourcing as the core. Cloud computing has become a hot-
spot of computer technology quickly, and enhance greatly the ability of processing
resources. However, the security challenges of cloud computing should not be over-
looked. Only in 2014 occurredpan-European automated real-time gross settlement
system 70 million user information was leaked. Home Depot company’s payment
systems suffered cyber attacks and nearly 56 million credit card users’ information was
in danger. Sony Pictures was attacked by hackers. Therefore, to make companies
organize large-scale application of cloud computing technology and platform, we must
thoroughly analyze and solve the security problems in cloud computing.
There is ubiquitous latent danger about data security and privacy because of cloud
computing’s dynamic nature, randomness, complexity and openness. The main security
issues of the current cloud computing are how to implement a mechanism to distin-
guish and isolate bad users to refrain users from potential safety threat. Meanwhile,
services and the quality of service providers in the cloud computing environment are
uneven, and the service provider is not sure to provide authentic, high-quality content
© Springer International Publishing Switzerland 2015
Z. Huang et al. (Eds.): ICCCS 2015, LNCS 9483, pp. 218–229, 2015.
DOI: 10.1007/978-3-319-27051-7_19
Research and Development of Trust Mechanism in Cloud Computing 219
and services. Therefore, it is essential to confirm the quality of cloud services and cloud
services provider.
Current research to solve the above problems are concentrated on the study of trust
and the mechanism of reputation aspect, whose basic idea is to allow trading partici-
pants to evaluate each other after the transaction, and according to all the evaluation
information to each participant to calculate this participant’s credibility to provide
references about choosing trade object to the other trading partners in network in the
future.
This paper is based on the key issues of trust mechanism to introduce its latest
research achievements. Section 2 of this paper introduced the concepts of cloud
computing. Section 3 analyzed relationship between the trust mechanism and cloud
computing security deeply. Section 4 selected the latest and typical trust model to
classify and review based on different methods of mathematics calculation. Section 5 to
review separately based on trust mechanism’s application situation of security prob-
lems in cloud computing layers. Section 6 analyzes current problems and prospects
new research opportunities.
2 Cloud Computing
At present, although there are many versions of the definition of cloud computing, the
most comprehensively accepted is the definition of the National Institute of Standards
and Technology [1], they believe cloud computing has five indispensable character-
istics: On-demand self-service, Broad network access, Resource pooling, Rapid elas-
ticity, Measured service, and the cloud services are divided into 3 levels: IaaS, PaaS,
SaaS. In order to achieve localization of computing resources, now Microsoft, IBM and
other companies can be considered to provide a new service model of server container
leasing services, which is called Hardware as a Service, HaaS [2].
HaaS, IaaS, PaaS, and SaaS are different in the functional scope and focus. HaaS
only meets the needs of tenants hardware resources, including storage space, com-
puting power, network bandwidth and so on, focusing on the performance of the
hardware resources and reliability. IaaS provides pay-as, measurable resource pools
function in heterogeneous resources environment, taking the full use of hardware
resources and users’ requirements into account; not only the integration of the
underlying hardware resources does PaaS concern about, but also provides users with
customizable applications services by deploying one or more application software
environments. SaaS not only achieves the full advantage of the underlying resources
required, it must also provide users with customizable application services through the
deployment of one or more application software environment. Paper [3] summarizes a
cloud service delivery model according to various embodiment ways of various service
models, which is shown in Fig. 1:
Cloud computing is essentially a methodological innovation in infrastructure
design, which has shared pool of IT resources composed by a large number of com-
puter resources. Cloud computing model has significant advantages in information
processing, information storage and information sharing, making dynamical creation of
highly visualized application services and data resources available to users.
220 J. Xu et al.
SaaS(Software
SaaS(Software as
as aa Service)
Service)
Google
Google Apps,SalesForce,Microsoft
Apps,SalesForce,Microsoft Live
Live Office,NewSuite
Office,NewSuite
PaaS(Platform as a Service)
Microsoft Azure,Google App Engine,Force.com
IaaS(Infrastructure as a Service)
Nirvanix SDN,Amazon AWS,CloudEx
In the complex network environment, threats to security may be in the form of various
ways. But generally speaking, the network security is intended to provide a protective
mechanism, to avoid being vulnerable to malicious attacks and illegal operations.
Basically all security mechanisms have adopted some trust mechanism to guard against
security attacks. But the development of any mechanism is accompanied by the game
between user and mechanism builder. With the understanding of the trust mechanism, a
variety of attacks to trust mechanism have emerged.
The key issues included in trust mechanism are trust modeling and data manage-
ment. The tasks of trust modeling are to design scientific trust model, describing and
reflecting the trust relationship in the system accurately by using appropriate metrics.
Data management relates to safety, efficient storage, access trust and their distribution
in a distributed environment which is in the absence of centralized control.
Security and privacy are the most concerned issues of cloud computing users. With
the emergence of more and more security risks, industry and academia have put for-
ward appropriate security mechanisms and management methods. Its main purposes are
to prevent cloud service providers from malicious leak or sell privacy information of
user, collecting and analyzing user data. Paper [4] summarized security problems faced
by cloud computing of specific services from technical perspective for the layers. Paper
[5] proposed a Framework including Cloud Computing Security Service System and
Cloud Computing Security Evaluation System.
Trust modeling and study of credibility management in cloud computing are still in
their infancy. Study of Trust Management in the current includes the establishment and
management of trust between service providers and their trust with users. Its main
information security ideas can be summarized as the three-dimensional defense and
defense in depth, forming a whole life cycle of the safety management whose main
feature are warning, attack protection, response and recovery.
Research and Development of Trust Mechanism in Cloud Computing 221
4 Trust Model
For different application scenes, many scholars used different mathematical methods
and tools to build various models of trust relationship. This section will introduce
common trust modes from the perspective of mathematical methods such as weighted
average, probability theory, fuzzy logic, gray reasoning, machine learning, statistical
analysis and analyze specific trust calculation.
Where, C represents a global trust value vector which is a normalized local trust value
!ðkÞ
matrix ½c , t
ij is the trust value vector afterK iterations, !
p is global trust value
vector of the pre-trusted peers (pi ¼ 1=jPj if i 2 P, otherwise pi ¼ 0), P is pre-trusted
peer set.
PowerTrust algorithm [7] has improved algorithm EigenTrust mainly from three
aspects: (1) confirm trusted peers collection reasonably. By mathematical reasoning,
proved the existence of the power-law relationship among peers evaluation, namely
there is a few Power peers, which formed credible set of peers by PowerTrust. (2) speed
up the convergence of the iteration process. PowerTrust put forward the strategy of
Look-ahead Random Walk (LRW), which made trust value polymerization rate
improved greatly. (3) establish a dynamic applicable mechanism. Its disadvantages
include: (1) It calculated the trust value without considering about the volume of
transaction, which allow malicious users to accumulate trust by small transactions and
deceive on large transactions easily. (2) there is no penalty to malicious behaviors.
PeerTrust [8] gives a local trust model, the trust value of peers is calculated only by
the peers who have had dealings with them, without the entire network iteration, the
mathematical description of the model is shown as Eq. (2):
X
IðuÞ
TðuÞ ¼ a Sðu; iÞ Crðpðu; iÞÞ TFðu; iÞ þ b CFðuÞ ð2Þ
i¼1
Where pðu; iÞ is the set of peers which trade with peer l in the i-th transaction, and the
credibility of peer v is CrðvÞ, TFðu; iÞ is the trust factor produced by the transaction
with peer l, a and b are weight parameter of standardized trustvalues, and a þ b ¼ 1.
222 J. Xu et al.
PeerTrust’s advantages are: (1) the evaluation factors are normalized so that
malicious peers can’t submit too high or too low rating. (2) proposed a trust evaluation
polymerization method PSM based on personal similarity to resist malicious peers’
collusion attack. (3) established a trust calculation method by using adaptive time
window to inhibit dynamic swing behavior of peers.
DyTrust model [9] presented a dynamic trust model based on the time frame, which
takes the impact of time on the trust calculations into account, the authors also intro-
duced four trust parameters in computing trustworthiness of peers, namely, short time
trust, long time trust, misusing trust accumulation and feedback credibility. Paper [10]
refined trust algorithm by introducing the experience factor, improving the expansi-
bility of feedback reliability algorithms in Dytrust model. Paper [11] further improved
the Dytrust model and enhanced the aggregation ability of feedback informationby
introducing risk factor and time factor.
that it specifies the prior probability distribution for presumed parameters, and then
according to the transaction results, using Bayes’ rule to speculate posterior probability
of parameters.
In Bayesian methods, Dirichlet prior probability distribution is assuming there are k
kinds of results, the prior probability distribution of each result appears uniform dis-
tribution, i.e., the probability of each occurrence is 1/k. There is a total of n transac-
tions, and each transaction givesP the evaluation, wherein the number of appearance of i
(i = 1, 2, …, k) evaluation is mi ( mi ¼ n). The posterior distribution of parameter p to
be estimated is:
1 Y
k
ðmi þ C=k1Þ
f ðp; m; kÞ ¼ pi ð5Þ
R1 Q
k
0 xðmi þ C=k1Þ dx i¼1
i¼1
mi þ C=k
Eðpi Þ ¼ ð6Þ
Pk
Cþ mi
i¼1
Paper [13] proposed trust algorithm based on Dirichlet distribution. Using proba-
bilistic expectations to express confidence reflect the uncertainty of confidence.
Introducing time decay factor in the calculation process, it can suppress partially
malicious users’ malicious transactions after accumulating certain confidence value.
But it didn’t give too much consideration to the ability of the algorithm’s resistance to
malicious acts or to the recommendation trust and transaction volume.
Fuzzy
Fuzzy Inference
Inference Trustworthiness
Evaluation Fuzzification
Fuzzification Defuzzification
Defuzzification
Engine
Engine Value
Fuzzy
Fuzzy rules
rules
P
S
½eðnmÞ=D ððtval tmin Þ=ðtmax tmin ÞÞ 5
s¼1
WTV ¼ ð7Þ
S
D is the time decay function, fuzzy membership function value triangular fuzzy
reasoning fuzzy when trust. After defuzzification, numerical trust value can be
obtained. FTE algorithm enhanced the ability to resist against malicious behavior by
adjusting WTV, OW and AC three input parameters. The downside is that: (1) There is
no calculation or assessment of OW. (2) It can be challenging to choose the mem-
bership function with high efficiency. (3) There is no demonstration of the model’s
convergence.
FuzzyTrust [15] use the fuzzy logic inference rules to compute peers’ global rep-
utation. It has a high detection rate of malicious peers, however, the model did not
consider the trust factors that affect the quality of the evaluation, and the authors did not
demonstrate the convergence of the model. FRTrust [16] uses the fuzzy theory to
calculate the peer trust level, reducing the complexity of the trust computation, and
improves the trust ranking precision. Paper [17] puts forward the ETFT model by the
combination of the evidence theory and fuzzy logic, which improves the adaptation
ability of the model in the dynamic environment, and the aggregation speed of the
recommendation trustis accelerated.
Khiabani et al. [21] propose to build trust model UTM by the integration of history,
recommendations, and other contextual information to calculate the scores between
individuals, which can be efficiently used for low-interaction environments.
Cruz et al. [26] have summed up the security problem in cloud computing, which
includes the infrastructure security, data security, communication security and access
control. Trust management has become the bridge of interaction entities in cloud
computing. This section describes the application of trust model in the aspects of
virtual machine security, user security, application reliability and service quality.
The research of trust management system is from centralized trust to distributed trust
relationship, from static to dynamic trust model, from single to multiple input factor
model, from evidence theory model to a variety of mathematical model. It can be said
that the study of trust relationship is a very active direction.
However, through summary we can see that the research on the trust mechanism
has the following problems in the theory and the realization: (1) The current study of
trust mechanism is lack of risk mechanism and performance evaluation criteria of the
unified trust model. (2) In the existing research, the performances of the trust model are
mostly evaluated by the method of simulating experiment, and there is no real per-
formance evaluation.
228 J. Xu et al.
Through this paper, we can see that in the cloud computing and other new com-
puting environment, various security requirements and application mode have put
forward new challenges to the trust mechanism. With the emergence of new computing
models and computing environments, such as cloud computing, internet of things and
so on, refining scientific problems under the new situation of trust mechanism and
carrying on the research have more urgent significance.
At the same time, it should also continue to explore new models suitable for
describing the dynamic trust relationship, combining knowledge of other subjects, such
as machine learning, artificial intelligence, etc.
Acknowledgments. This work is supported by the China Aviation Science Foundation (NO.
20101952021) and the Fundamental Research Funds for the Central Universities (NO.
NZ2013306).
References
1. Mell, P., Grance, T.: The NIST definition of cloud computing. Nat. Inst. Stand. Technol. 53,
50 (2009)
2. Chuang, L., Wen-Bo, S.: Cloud Computing Security: Architecture, Mechanism and
Modeling. Chin. J. Comput. 36, 1765–1784 (2013) (in Chinese)
3. Almorsy, M., Grundy, J., Müller, I.: An analysis of the cloud computing security problem.
In: Proceedings of APSEC 2010 Cloud Workshop, Sydney, Australia, 30 November 2010
4. Subashini, S., Kavitha, V.: A survey on security issues in service delivery models of cloud
computing. J. Netw. Comput. Appl. 34, 1–11 (2011)
5. Feng, D.G., Zhang, M., Zhang, Y., Zhen, X.U.: Study on cloud computing security.
J. Softw. 22, 71–83 (2011)
6. Kamvar, S.D., Schlosser, M.T., Garcia-Molina, H.: The Eigentrust algorithm for reputation
management in P2P networks. In: Proceedings of the 12th International World Wide Web
Conference, WWW 2003 (2003)
7. Kai, H., Zhou, R.: PowerTrust: a robust and scalable reputation system for trusted
peer-to-peer computing. IEEE Trans. Parallel Distrib. Syst. 18, 460–473 (2007)
8. Li, X.: Liu, L.: PeerTrust: supporting reputation-based trust for peer-to-peer electronic
communities. IEEE Trans. Knowl. Data Eng. 16, 843–857 (2004)
9. Jun-Sheng, C., Huai-Ming, W.: DyTrust: A time-frame based dynamic trust model for P2P
systems. Chin. J. Comput. 29, 1301–1307 (2006) (in Chinese)
10. Shao-Jie. W., Hong-Song, C.: An improved DyTrust trust model. J. Univ. Sci. Technol.
Beijing, 30, 685–689 (2008) (in Chinese)
11. Zhi-Guo, Z., Qiong, C., Min-Sheng, T.: Trust model based on improved DyTrust in P2P
network. Comput. Technol. Dev. 174–177 (2014) (in Chinese)
12. Despotovic, Z., Aberer, K.: Maximum likelihood estimation of peers; performance in P2P
networks. In: The Second Workshop on the Economics of Peer-to-Peer Systems (2004)
13. Haller, J., Josang, A.: Dirichlet reputation systems. In: 2012 Seventh International
Conference on Availability, Reliability and Security, 112–119 (2007)
14. Schmidt, S., Steele, R., Dillon, T.S., Chang, E.: Fuzzy trust evaluation and credibility
development in multi-agent systems. Appl. Soft Comput. 7, 492–505 (2007)
15. Song, S., Kai, H., Zhou, R., Kwok, Y.K.: Trusted P2P transactions with fuzzy reputation
aggregation. IEEE Internet Comput. 2005, 24–34 (2005)
Research and Development of Trust Mechanism in Cloud Computing 229
16. Javanmardi, S., Shojafar, M., Shariatmadari, S., Ahrabi, S.S.: FRTRUST: a fuzzy reputation
based model for trust management in semantic P2p grids. Int. J. Grid Util. Comput. 6, (2014)
17. Tian, C., Yang, B.: A D-S evidence theory based fuzzy trust model in file-sharing P2P
networks. Peer Peer Netw. Appl. 7, 332–345 (2014)
18. Kuter, U.: Using probabilistic confidence models for trust inference in web-based social
networks. ACM Trans. Int. Technol. Toit Homepage 10, 890–895 (2010)
19. Tang, J., Lou, T., Kleinberg, J.: Inferring social ties across heterogeneous networks. In:
WSDM 2012, 743–752 (2012)
20. Rettinger, A., Nickles, M., Tresp, V.: Statistical relational learning of trust. Mach. Learn. 82,
191–209 (2011)
21. Khiabani, H., Idris, N.B., Manan, J.L.A.: A Unified trust model for pervasive environments
– simulation and analysis. KSII Trans. Int. Inf. Syst. (TIIS) 7, 1569–1584 (2013)
22. Liu, H., Lim, E.P., Lauw, H.W., Le, M.T., Sun, A., Srivastava, J., Kim, Y.A.: Predicting
trusts among users of online communities: an epinions case study. In: Ec 2008 Proceedings
of ACM Conference on Electronic Commerce, pp. 310–319 (2008)
23. Zolfaghar, K., Aghaie, A.: A syntactical approach for interpersonal trust prediction in social
web applications: Combining contextual and structural data. Knowl. Based Syst. 26, 93–102
(2012)
24. Tang, J., Gao, H., Hu, X.: Exploiting homophily effect for trust prediction. In: Proceedings
of the Sixth ACM International Conference on Web Search and Data Mining, 53–62 (2013)
25. Ying, Wang, Xin, Wang, Wan-Li, Zuo: Trust prediction modeling based on social theories.
J. Softw. 12, 2893–2904 (2014). (in Chinese)
26. Cruz, Z.B., Fernández-Alemán, J.L., Toval, A.: Security in cloud computing: a mapping
study. Comput. Sci. Inf. Syst. 12, 161–184 (2015)
27. Zheng-Ji, Z., Li-Fa, W., Zheng, H.: Trust based trustworthiness attestation model of virtual
machines for cloud computing. J. Southeast Univ. (Nat. Sci. Ed.) 45(1), 31–35 (2015)
(in Chinese)
28. Tan, W., Sun, Y., Li, L.X., Lu, G.Z., Wang, T.: A trust service-oriented scheduling model
for workflow applications in cloud computing. IEEE Syst. J. 8, 868–878 (2014)
29. Sidhu, J., Singh, S.: Peers feedback and compliance based trust computation for cloud
computing. In: Mauri, J.L., Thampi, S.M., Rawat, D.B., Jin, B. (eds.) Security in Computing
and Communications, vol. 467, pp. 68–80. Springer, Heidelberg (2014)
30. Xiao-Lan, X., Liang, L., Peng, Z.: Trust model based on double incentive and deception
detection for cloud computing. J. Electron. Inf. Technol. 34(4), 812–817 (2012) (in Chinese)
31. Jaiganesh, M., Aarthi, M., Kumar, A.V.A.: Fuzzy ART-based user behavior trust in cloud
computing. In: Suresh, L.P., Dash, S.S., Panigrahi, B.K. (eds.) Artificial Intelligence and
Evolutionary Algorithms in Engineering Systems, vol. 324, pp. 341–348. Springer, India
(2015)
32. Yan-Xia, L., Li-Qin, T., Shan-Shan, S.: Trust evaluation and control analysis of
FANP-based user behavior in cloud computing environment. Comput. Sci. 40, 132–135
(2013) (in Chinese)
33. Guo-Feng, S., Chang-Yong, L.: A security access control model based on user behavior trust
under cloud environment. Chin. J. Manag. Sci. 52, 669–676 (2013) (in Chinese)
Analysis of Advanced Cyber Attacks
with Quantified ESM
1 Introduction
identification of security weaknesses and analysis of the existing security policy. The
objective of the model was to offer analysts an efficient tool for attack modeling.
The ESM is also very suitable as a tool to impart knowledge in the field of advanced
information security in terms of offensive security.
To enable analysts more credible basis for decision support and even more effective
transfer of knowledge and new insights into the educational process we present in this
paper a quantification of the ESM.
The sections in the paper are further organized as follows: Sect. 2 is the back-
ground, inter alia, introducing attack modeling, the ESM model and CVSS framework.
Section 3 presents the quantification of the ESM model with an additional explanation
of the rules for applying the attribute values and the purpose of quantification. The
validation and an example of the use of the ESM can be found in Sect. 4. In Sect. 5 an
overview of recent attack trees as a basic structural technique for attack modeling is
given. Section 6 is a discussion followed by a conclusion in the last section.
2 Background
This section briefly presents the basis for further understanding. Thus, first of all, we
introduce the purpose of attack modeling followed by the brief presentation of the ESM
model and CVSS standard.
For deliberate security decisions we need metrics that allow us to know the effect on the
system in a successfully implemented individual attack. In the analysis of the attack
models the qualities of different attributes are taken into consideration. A set of attri-
butes is broad and often includes various attributes of probabilities, risks, the effects of
the attacks, the expenses– particularly in terms of security systems [11].
Our selection of attributes intentionally avoids the use of probability and those
attributes with probability as one of the factors. The ESM model is in first place
intended for modeling sophisticated cyber attacks on critical infrastructure, the char-
acteristics of which are a clear target and desired objective of the attack. That is why
our set of attributes of the model is the following:
• Cost: It expresses the financial value of an individual attack in the operation. Most
often this is an accounting value and it is not generally identifiable, because the
value of the same attack techniques often varies.
• Complexity: It displays the complexity of a particular attack technique. The score
can be based on the value that can be understood from the scoring systems, such as
CVSS.
• Impact: It represents numerically expressed impact on the system caused by an
individual attack. As in the complexity section, the value here is given analytically
or with the help of the scoring systems.
In Table 1 the rules relate to the use of the presented attributes. Attribute values are
applied only to the end-node, the values of intermediate and root nodes are calculated
according to the rules.
Individual values for the end node attributes are set by analysts. The cost of the
attack can be supported by traditional financial calculation. The values for the com-
plexity attribute can result from different frameworks for communicating characteristics
234 B. Ivanc and T. Klobučar
and impacts of vulnerability. Among the utilities for the selection of the value, we can
also classify other tools to detect vulnerability, intrusion detection systems and tools to
manage security updates systems in the event of an established laboratory environment
or otherwise acquired capacities. The values for the impact attribute can be defined
after a preliminary analytical preparation, which creates a mapping of numerical value
into a descriptive definition of the value. Such mapping is shown in Table 2, sum-
marized in the display [8]:
Table 2. The mapping of numerical value of the impact into a descriptive definition.
Numerical Range Impact Definition
1-3 Minor impact to system.
4-6 Moderate impact to system.
7-9 Severe impact to system.
10 System completely compromised, inoperable, or destroyed.
one of the nodes – conditional subordination node (CSUB). In this attribute, the valuei
presents the value of individual attribute by the AND operation of nodes that are
descendants of the CSUB node. Valuep represents the value of the attribute in the
actuator node which is attached to the CSUB node. It should also be noted that unlike
in the CSUB node, the housing (second) node in the ESM is not classified as an
intermediate node, but has a function of the end-node.
Table 3. Rules for the set of attributes in the ESM in the CSUB node
Conditional subordination node
Cost If ∑ ni=1costi < costP
then ∑ ni=1costi
else costP
Complexity If Maxni=1complexityi < complexityP
then ∑ ni=1complexityi
else complexityP
Qn
Impact 10n ð10impacti Þ
If i¼1
> impactP
Qn
10ðn1Þ
10n ð10impacti Þ
then i¼1
10ðn1Þ
else impactP
It is considered for the CSUB that the objective has been achieved when all
subgoals in nodes-offspring reach a target. It can also be that a node-initiator is selected
to reach the objective (Figure P-1). This node can therefore be written as follows:
G-0 = P-1 or (G-1 and G-2).
This section is intended to illustrate an example and is composed as follows: first comes
a presentation of computer-network operation with the ESM model, which is equipped
with all five properties belonging to it. In line with this model, there are four standard
tables that allow mapping codes (nodes, attack vectors, exploited vulnerabilities and
segments) in understandable descriptions. This is followed by the reading of the model.
The detailed methodology of the use of the ESM model can be found in [4].
Furthermore, the section presents the application of attributes. There is a table
displaying the assignment of attribute values to the end-node in the model and cal-
culations of the values in the root of the tree structure and intermediate nodes.
Fig. 2. ESM model with main goal G-0 “eavesdropping on the internal communications”.
Analysis of Advanced Cyber Attacks with Quantified ESM 237
Table 5. Description of the attack vectors labelled in the enhanced structural model in Fig. 2.
Vector Description
v1 Optical Splitting Technique (e.g. FGB – Fiber Bragg Gratings)
v2 DLL Hijacking
Model reading: The main objective of the attack described by the model presents a
node G-O “eavesdropping on the internal communications”. The node is of OR type
and that means we can decide between nodes G-1 or G-2, so-called partial objectives,
to carry out an attack.
238 B. Ivanc and T. Klobučar
The node G-1 is Conditional subordination node (CSUB). In order to achieve the
objective of this node, i.e. “capture traffic data with the physical presence”, we need to
carry out such an activity in the G-3 node as well as an activity envisaged by the G-4
node. In implementing G-4 node there is an additional information item present, rep-
resented by the attack vector (v1) that prescribes a specific offensive method. Due to
the properties of the CSUB node, the objective can also be achieved with an alternative
implementation of the actuator node P-1. Here we also encounter the prescribed
offensive method presented with the parameter v2.
The G-2 node is a classic OR type of the node. To achieve the objective of this
node we can choose between the activities in the nodes G-5 and G-6 as well as in the
housing node O-1. In line with the rules of the OR logic expressions, both activities in
nodes G-5 and G-6 can be carried out. However, if we choose an activity with an O-1
node, then we certainly did not finish the G-6 activity, because there is a XOR oper-
ation between the nodes. This part of the tree structure is equipped with codes of
exploited vulnerability. The codes direct us to the proper selection of certain attribute
values in individual nodes when used as standards to assess vulnerability.
Table 9. (Continued)
End-node Attribute values* Source
P-1 Cost: 500€ Accounting value
Complexity: MEDIUM Expert assessment
Impact: 5 Expert assessment
O-1 Cost: 2000€ Estimation
Complexity: HIGH Expert assessment
Impact: 8 Expert assessment
Values are only an example of the application of the
attribute value. They are set according to the selected
target, its environment and the complexity of the
operation and that is why values for the same operation
or attack can vary.
Authors in [13] present a case study using an extended attack tree called attack defense
tree. Their model consists of defense nodes and a set of different attributes. Case study
consists of four phases: creation of the model, attribute decoration, preparation of
attribute value and at the end calculation. Different attributes are used for the case study
(costs, detectability, difficulty, impact, penalty, profit, probability, special skill and
time), which are presented in the table below, including a description and stock values.
In a feasibility study author [14] solves the problem of the construction of big
attack trees and their maintenance. In his work, thus, he proposes an approach to
automatize the construction of such models. He states that the automatic construction of
a model for a specific threat is to some extent feasible. Further handmade manufacture
is thus required in lower parts of the model. In order to support the automatic con-
struction of the model, it is necessary to have different inputs, architecturally as well as
tools for making risk assessments and a set of security knowledge.
The authors [15] present the foundations of the expanded tree attack model which
increases the expressiveness of the model using the operator that allows the modeling of
ordered events. Their model, named SAND attack tree, thus includes sequential con-
junctive operator – SAND. In the paper the model is semantically defined, the semantics
is based on the series-parallel (SP) graphs. With the work on attributes within their model,
authors also allow a quantitative analysis of the attacks using the bottom-up algorithm.
The author [16] states that present information systems are becoming more
dynamic and comprehensive, that is why the security models face the problem of
scalability. Furthermore, he says that the existing security models do not allow for
the capture and analysis of unforeseen scenarios resulting from unknown attacks. In the
dissertation, the author thus states methods appropriate for security modeling in
large systems and developing methods for the efficient treatment of countermeasures.
On the basis of the above, inter alia, he presents a hierarchical security model and
methods of assessments and thus reaches pre-set goals.
The authors [17] use the attack tree for the systematic presentation of possible
attacks in the proposal of a new quantitative approach to security analysis of the
computer system. The authors [18] focus on the treatment of defense against malicious
attacks that system administrators have to deal with. In the analysis, they use attack tree
to optimize the security function. The attack tree deals with attributes that are relevant
when facing attacks and counter-measure settings. They thus provide better support for
decision-making and a choice among possible solutions.
The authors [19] use the attack tree model as a basis to identify malicious code on
the Android systems. The expanded version of the model allows new, flexible way of
organization and the usage of rules. Furthermore, a combination of static and dynamic
analysis enables better accuracy and performance. The authors [20] want to demon-
strate with their contribution that the attacker profiling can be a part of already existing
quantitative tools for security assessment. This provides a more precise assessment of
item changes during changes of subordinate components and allows greater analytical
flexibility in terms of foresight, prioritization and prevention of attacks.
Analysis of Advanced Cyber Attacks with Quantified ESM 241
6 Discussion
Structural models for attack modeling, such as attack tree, face various complaints.
Some say that in the attack tree modeling it is difficult to reuse or divide certain attack
trees [21]. Others point to the insufficient accuracy in the presentation of the attack in
the model analysis [22]. That is why we introduced our ESM model with features that
enable a more realistic view of the attacks. The main objective of the first design of the
ESM model was a contribution to a better understanding of the attack implementation
and identification of security weaknesses.
When discussing a set of related techniques in the field of attack trees, we found the
quantification of security, particularly with bottom-up algorithms, a very active area.
The quantification of the ESM model presented in the paper provides support in attack
analysis by the presence of three mutually independent attributes. Such support is
mostly useful when the model offers an alternative way of the implementation of the
attack and a need for a more tangible basis for the selection of the attack. The model
does not offer friendly options to find the best way of carrying out complete attack
given the selected attribute. Also, individual values of the root structures can come
from different branches. This means that, for example, the model with the value of the
cost attribute in the main goal (G-0) shows the lowest price for carrying out the attack,
but does not allow the achievement of the impact that is displayed by this attribute in
the same node.
As mentioned in the paper, the allocation of the initial value of the attribute in the
final attributes is based on various sources. The accounting value as well as the impact
of the attack are merely an illustration in the validation. The objectivity of values is
achieved only when we have sufficient information about target systems which enable
the design of the ESM model, as well as secondary information which can vary in
different targets of the same attack. An additional source of input for the allocation of
the attribute value is the CVSS standard, but this scoring does not address the economic
impact in case of exploited vulnerability. Besides, basic metrics of the CVSS standard
and the scoring itself can be differently interpreted.
7 Conclusion
In the future, we plan to test the Quantified ESM on a wider set of complex
examples. It would be reasonable to integrate a database with known vulnerabilities
and other associated attributes and create a catalogue with the implemented threats and
their analyses.
References
1. Pietre-Cambacedes, L., Bouissou, M.: Beyond attack trees: dynamic security modeling with
Boolean logic Driven Markov Processes (BDMP). In: European Dependable Computing
Conference, pp. 199–208 (2010)
2. Fovino, I.N., Masera, M., De Cian, A.: Integrating cyber attacks within fault trees. Reliab.
Eng. Syst. Saf. 9, 1394–1402 (2009)
3. Ivanc, B., Klobucar, T.: Attack modeling in the critical infrastructure. J. Electr. Eng.
Comput. Sci. 81(5), 285–292 (2014)
4. Ivanc, B., Klobučar, T.: Modelling of information attacks on critical infrastructure by using
an enhanced structural model, Jozef Stefan International Postgraduate School (2013)
5. Yan, J., He, M., Li, T.: A Petri-net model of network security testing. In: IEEE International
Conference on Computer Science and Automation Engineering, pp. 188–192 (2011)
6. Ten, C.W., Manimaran, G., Liu, C.C.: Cybersecurity for critical infrastructures: attack and
defense modeling. IEEE Trans. Syst. Man Cybern. Part A: Syst. Humans 4, 853–865 (2010)
7. Camtepe, A., Bulent, Y.: Modeling and detection of complex attacks. In: Third International
Conference on Security and Privacy in Communications Networks and the Workshops,
pp. 234–243 (2007)
8. Edge, K., Raines, R., Grimaila, M., Baldwin, R., Bennington, R., Reuter, C.: The use of
attack and protection trees to analyze security for an online banking system. In: Proceedings
of the 40th Hawaii International Conference on System Sciences, p. 144b (2007)
9. Bistarelli, S., Fioravanti, F., Peretti, P.: Defense trees for economic evaluation of security
investments. In: The First International Conference on Availability, Reliability and Security,
pp. 416–423 (2006)
10. Mell, P., Scarfone, K., Romanosky, S.: A Complete Guide to the Common Vulnerability
Scoring System Version 2.0. (2007)
11. Kordy, B., Pietre-Cambacedes, L., Schweitzer, P.: DAG-based attack and defense modeling:
Don’t miss the forest for the attack trees. Comput. Sci. Rev. 13–14, 1–38 (2014)
12. Khand, P.A.: System level security modeling using attack trees. In: 2nd International
Conference on Computer, Control and Communication, pp. 1–7 (2009)
13. Bagnato, A., Kordy, B., Meland, P.H., Schweitzer, P.: Attribute decoration of attack-defense
trees. Int. J. Secure Softw. Eng. 3(2), 1–35 (2012)
14. Paul, S.: Towards automating the construction & maintenance of attack trees: a feasibility
study. In: Proceedings of the 1st International Workshop on Graphical Models for Security,
pp. 31–46 (2014)
15. Jhawar, R., Kordy, B., Mauw, S., Radomirović, S., Trujillo-Rasua, R.: Attack trees with
sequential conjunction. In: Federrath, H., Gollmann, D. (eds.) SEC 2015. IFIP AICT, vol.
455, pp. 339–353. Springer, Heidelberg (2015)
16. Hong, J.B.: Scalable and Adaptable Security Modelling and Analysis, PhD Thesis.
University of Canterbury (2015)
17. Almasizadeh, J., Azgomi, M.A.: Mean privacy: a metric for security of computer systems.
Comput. Commun. 52, 47–59 (2014)
Analysis of Advanced Cyber Attacks with Quantified ESM 243
18. Dewri, R., Ray, I., Poolsappasit, N., Whitley, D.: Optimal security hardening on attack tree
models of networks: a cost-benefit analysis. Int. J. Inf. Secur. 11(3), 167–188 (2012)
19. Zhao, S., Li, X., Xu, G., Zhang, L., Feng, Z.: Attack tree based android malware detection
with hybrid analysis. In: IEEE 13th International Conference on Trust, Security and Privacy
in Computing and Communications, pp. 1–8 (2014)
20. Lenin, A., Willemson, J., Sari, D.P.: Attacker profiling in quantitative security assessment
based on attack trees. In: Bernsmed, K., Fischer-Hübner, S. (eds.) NordSec 2014. LNCS,
vol. 8788, pp. 199–212. Springer, Heidelberg (2014)
21. Dalton, G.C., Mills, R.F., Colombi, J.M., Raines, R.A.: Analyzing attack trees using
generalized stochastic Petri nets. In: Information Assurance Workshop, pp. 116–123 (2006)
22. Pudar, S., Manimaran, G., Liu, C.C.: PENET: a practical method and tool for integrated
modeling of security attacks and countermeasures. Comput. Secur. 28(8), 754–771 (2009)
A Web Security Testing Method
Based on Web Application Structure
1 Introduction
The purpose of structure analysis is to describe the Web application. In this section, we
propose a new model named Web relation graph (WRG) and introduce several types of
paths in WRG which can describe the security of Web application.
Define 2.2. When a data edge e (e2ED) exist between two Web elements, there exit
data transfer phenomena Web entities exist between these two elements, then the
collection of the data pass through e is Datae.
A simple scene of user login described by WRG is shown in Fig. 2. User submit the
personal information to the server by n1. Verification process take place in n2 and n7. If
the validation is successful, entered personal center. In WRG, personal center n4 is
constructed by dynamic service page n3. Different users will enter different personal
center. When the users log out from the personal center n4, the service n5 will clear the
personal information from the session, then user can enter the start page n1. Otherwise,
enter the wrong tips page n6, can user redirect to n1 from n6.
server side, and provide data access and processing functions. The detail is presented in
Table 1.
We will discuss several special types of paths in WRG based on this constraint.
Definition 2.3. There exists path p in WRG, if all the edges of p are the data edge (ED),
this path is called the data relation path.
T
n Sn
Figure 3 depict a sample of data relation path. Dataeki Di; j; k
i¼1 k¼1
Definition 2.4. ni and nj are Web elements in WRG, Di,j is the collection of the data
transferred from ni to nj. If there exist a data relation path pk between ni and nj, Di,j,k is
the collection of the data transferred by pk. If there are sevarial paths Ek between ni and
nj, Ek = {ek1, ek2,.., ekn}, Di,j,k = , Di,j = 。
In addition to the data relation path, there are two other special types of paths
which have relations with Cross-Site Script and SQL injection. These two types of
paths can describe the security properties of Web applications.
SQL injection is a type of vulnerability associated with the database scripts. In the
case of SQL injection being triggered, malicious users filled the input controllers with
248 X. Yu and G. Jiang
malicious SQL scripts, then submit the information to the server via the network.
Server received, process the data, generate SQL statements and injected the statements
into the database to complete a malicious database operations.
In WRG, the malicious data was submitted to Web Service (NS) or Dynamic Service
Page (NSP) from static page (NP) through data edge (ED). These two types of ele-
ments integrate the malicious data to SQL statement and inject data into database. The
detail data transfer mode as shown in Fig. 4. In WRG, SQL injection only exist in the
SQL injection related path. The detail depiction as shown in Definition 2.5.
Definition 2.5. In WRG, SQL injection related path P needs to satisfy the following
conditions.
(1) There exist a path segment p’ in P, the start node of p’ is static page (NP), and the
end node of p’ is database node(NDB).
(2) p’ is data relation path (DRP), if ns is the start node, and ne is the end node of p’,
Ds,e ≠ Ø;
Definition 2.6. In WRG, XSS related path P needs to satisfy any one of the following
conditions. P’s start node ns and P’s end node ne are both static pages (NP).
(1) P is a data relation path (DRP), and Ds,e ≠ Ø;
(2) The last edge of P is a generation edge (EG), others are data edge (DE), and the
last but one node is nc, and Ds,c ≠ Ø;
SQL injection related path and XSS related path are called vulnerability related
path (VRP), vulnerabilities only exist in these paths.
Here, we discuss the scene depicted in Fig. 6, the data transfer situation as follow:
Datae1 = {username, password}, Datae2_in = {username, password}, Datae2_out = {id,
age, username}, Datae3 = {username, id, age}, Datae5 = {log_out}. In Fig. 6, there are
four complete WRG paths. In these four paths, the path 1 meets the Definitions 2.5 and
6, path 3 meets the Definition 2.5, and the other paths are normal WRG paths.
Tester need to test the vulnerabilities in the paths which meet the conditions of
vulnerability related path. The specific security testing method for the vulnerability
related path is discussed in Sect. 3.
We can use WRG to describe Web application, and describe the security properties of
Web application with vulnerability related paths. WRG can provide guidance for
security testing. In this section, we propose a security testing framework on the basis of
vulnerability related path. This framework contains client side testing and server side
testing. The framework as shown in Fig. 7.
250 X. Yu and G. Jiang
Definition 3.1. Data form DF = (INPUTS, BUTTONS, dfs), INPUTS for set of input
injection points; BUTTONS for set of event injection points; dfs = {<vars, button > |
vars INPUTS, button2BUTTONS}, < vars, button > needs to meet the follow
conditions: the trigger of button’s event can produce request, and vars are the part of
the request.
Algorithm 3.1 is used to create the data form model of a static page. In Fig. 8, we
gave a example. Firstly, fill the input injection points normal data, and trigger the
submit events of event injection points, and analyze the Request.
In Web application, a server element ni will match with a processing module classi.
S
n
DTEi DTEi presents the data transfer to ni (ni2NS), DTEi = {< var, classi > | e’s end
i¼1
point is ni (e2ED), var2Datae, classi presents the ni’s processing module}. DT presents
the data transfer in a vulnerability related path, DT = .
Definition 3.2. TV is the set of tainted variable, CV presents the set of candidate
variable, and satisfy the following conditions.
(1) CV TV, CV is the child set of TV;
(2) 8v2CV, variable v transfers from client to server, and the data in v come from
users’ inputs.
Definition 3.3. VV presents the set of vulnerability variable, and VV should satisfy the
following conditions.
(1) VV CV, VV is a child set of CV;
(2) 8v2VV, variable v transfer from server to client or to database;
(3) 8vi2VV, there are several validation functions for vi, f1…fn. The set of these
function named Chaini. Chaini is the security mechanism of server side, the object
of static analysis is to find VV and the corresponding Chain. The dynamic testing
is to test the functions in Chain, the result is the final result of server side testing.
f: States × Ops → States is variable state transition map, shown in Fig. 10.
In Fig. 10, when a variable is defined, it’s state is Nor. M operation leads to tainted
information transferring, the state of a variable will change to Candi with M operation.
When the state of variable is Candi, the variable is a candidate variable which contains
tainted information. C operation is clean operation, security mechanism. In the
254 X. Yu and G. Jiang
traditional tainted analysis, the state of a variable will change to Nor. However, the
security mechanism may not always be effective. So there will be two types of
transform, changed to Nor or maintain. So we increase a new state PCandi which is a
uncertain state in the transition. V is exposing operation, the vulnerability will be
exposed when operating the variable under Candi with V.
IFS presents information flow, IFS = {< num, src, dest, func > |src, dest 2 Vars,
func 2 Ops}. num for the number of a flow; src for the source variable; dest for the
destination variable, the information is flowing to the destination variable from the
source variable in a information flow; func for the function in a flow. IFS is used to
depict the data transferring in server elements and analysis the vulnerability. The detail
construction rules according to [4].
VVAM can describe the information of the server element. We can extract VV and
Chain from VVAM with the Algorithm 3.2.
A Web Security Testing Method 255
In the judgment criterion, there is no vulnerabilities in the whole path as long as any
side of the path make defense perfectly before the vulnerabilities being triggered.
The server side module is based on JTB [16], a syntax tree generation tool. The
analysis result is the vulnerability variables and security mechanism in the server
elements. Server side module contains three child modules, syntax tree and program
structure analysis module, information flow generation module, and tainted analysis
module. Figure 12 shows the structure of server side module.
We analyze and test this application with WebTester, the results are shown in
Table 6. There are 10 SQL related paths, 11 XSS related paths and 9 other paths in the
WRG. The results show that the SQL injection vulnerabilities only exist in SQL related
paths, and cross-site script vulnerabilities only exist XSS related paths.
5 Conclusion
In this paper, we consider the Web security from the perspective of the structure of the
Web application. We propose WRG to describe the structure and security properties of
Web application. At the same time, we classify the elements of the WRG into two
types, client elements and server elements, and propose a security testing framework on
the basis of vulnerability related path. The testing framework contains client side
testing and server side testing. The test results of these two parts determine whether
there are vulnerabilities in a vulnerability related path. The client side testing is to test
the security mechanism in the client elements. In this part we propose the model named
data form to describe the information of the client elements and use the method of
penetration testing on basis of data form. The server side testing is to test the security
mechanism in the server side elements. In this part, use the method of static taint
analysis to extract the security mechanism in the server side elements. In the last part of
the sever side testing, we need to test the security mechanism with dynamic testing. In
the end of this paper, we implement a testing tool named WebTester with the method
described in Sect. 3. WebTester can make some assist for tester in Web security test. In
the future work, we will improve Web relation graph model and hope to describe more
types of Web vulnerabilities with Web relation graph.
References
1. Mookhey, K.K., Burghate, N.: Detection of SQL injection and cross-site scripting attacks
(2004). http://www.securityfocus.com/infocus/1768
2. McGraw, Gary: Software security. IEEE Secur. Priv. 2(2), 80–83 (2004)
3. MeDermott, J.P.: Attack net penetration testing. In: Proceedings of the 2000 Workshop on
New Security Paradigms (NSPW2000), pp. 15–21. Bellybutton, Ireland (2000)
4. Jiye, Z., Xiao-quan, X.: Penetration testing model based on attack graph. Comput. Eng. Des.
26(6), 1516–1518 (2005)
5. Huang, Q., Zeng, Q.-K.: Taint propagation analysis and dynamic verification with
information flow policy. J. Softw. 22(9), 2036–2048 (2011)
6. Lam, M.S., Martin, M.C., Livshits, V.B., Whaley, J.: Securing Web applications with static
and dynamic information flow tracking. In: Proceedings of the 2008. ACM (2008)
258 X. Yu and G. Jiang
7. Hallaraker, O., Vigna, G.: Detecting malicious javascript code in mozilla. In: Proceeding of
10th IEEE International Conference on Engineering of Complex Computer Systems
(ICECCS 2005), pp. 85–94. IEEE, Shanghai, China, CS (2005)
8. Huang, Y.W., Huang, S.K., Lin, T.P.: Web application security assessment by fault injection
and behavior monitoring. In: Proceedings of the 12th International Conference on World
Wild Web, WWW 2003, pp. 148–159 (2003)
9. Ricca, F., Tonella, P.: Analysis and testing of web applications. In: Proceedings of the 23rd
International Conference on Software Engineering, pp. 25–34 (2001)
10. Ricca, F., Tonella, P.: Web site analysis: structure and evolution. In: Proceeding of
International Conference on Software Maintenance, pp. 76–86 (2000)
11. Tonella, P., Ricca, F.: A 2-layer model for the white-box. In: Proceeding of 26th Annual
International Telecommunications Energy Conference on 2004, pp. 11–19 (2004)
12. Ricca, F., Tonella, P.: Using clustering to support the migration from static to dynamic web
pages. In: Proceeding of the 11th IEEE International Workshop on Program Comprehension,
pp. 207–216 (2003)
13. Ricca, F.: Dynamic model extraction and statistical analysis of web application. In:
Proceeding of 4th International workshop on Web Site Evolution, pp. 43–52 (2002)
14. Ricca, F., et al.: Understanding and restructuring web sites with ReWeb. IEEE MultiMedia
Cent. Sci. Technol. Res. 8, 40–51 (2001)
15. Chen, J.F., Wang, Y.D., Zhang, Y.Q.: Automatic generation of attack vectors for
stored-XSS. J. Grad. Univ. Chin. Acad. Sci. 29(6), 815–820 (2012)
16. JTB: the java tree builder homepage [EB/OL] (2000). http://compilers.cs.ucla.edu/jtb/jtb-
2003
An Improved Data Cleaning Algorithm
Based on SNM
1 Introduction
With the rapid development of social information technology, more and more enter-
prises built its own data warehouse, but the enterprise system databases have inevitably
dirty data problem because of data entry errors, spelling errors, missing fields and other
factors, many areas involve data cleaning technology such as text mining, search
engine, government agency, enterprise data warehouse and so on. Data cleaning aims
to improve data quality by removing data errors and approximately duplicate records
from the original data, approximately duplicate records mean the records that have
semantic differences but actually imply the same entity, it’s a waste of storage space
and will affect the real distribution of the data if the database contains too many
approximately duplicate records, so the detection and elimination of approximately
duplicate records is one of the most critical issues in data cleaning.
The critical point of detection and elimination of approximately duplicate records is
the match/merge problem, of which the most complicated step is to determine whether
two records are similar, however, due to errors from various reasons or different
representations for the same entity in different data sources, detecting whether two
records are equal is not a simple arithmetic calculation. So far the common recognition
algorithms are Field matching algorithm, N-gram matching algorithm, Clustering
algorithm and Edit distance algorithm and so on. The most simple way is to traverse all
the records in database and compare each pair of them, but the algorithm’s time
complexity is O(n2) and that time consumption is not appropriate for the large database
system. Sort/merge method is the standard algorithm to detect approximately duplicate
records in database, its basic idea is to sort the data set first and then compare the
adjacent records. Reference [1] assigned each record a N-gram value, then sorted the
records according to the value and used the priority queues to cluster these records.
Reference [2] raised the concept of variable sliding window and proposed to reduce
missing match and improve the efficiency by adjusting the window’s size promptly
according to the result of comparison between the similarity value and the threshold.
Reference [3] proposed the concept of filtering mechanism,attributes were divided into
sort set Rk and mode set Rm, the algorithm sorted all the records many times according
to the attributes in Rk, after each sort, called the detection function, repeated the process
until all the approximately duplicate records were figured out. Reference [4] adopted
the clustering algorithm to gather all the probably similar records in a cluster and then
make pair-wise comparisons of these records. Reference [5] used the dependence graph
concept to calculate the key attributes in the data table, then divided the record set into
several small record sets according to the value of the key attribute and detected
approximately duplicate records in each small record set. When merging the duplicate
records, a master record was kept and the information of other duplicate records was
merged into the main record, then deleted these records. This method can be applied to
the duplicate record detection of big data quantity and can improve the detection
efficiency and accuracy. Reference [6] proposed to use the ranking method to assign
every attribute a appropriate weight based on the approximately duplicate records
detection algorithm, the detection accuracy was improved by the influence of weight on
the similarity calculation between two attributes during the process of record
comparison.
All the related work above is based on the sort/merge idea and aims to improve the
algorithm’s efficiency by reducing the number of records for comparison. But in large
data era, the reduced records are far from enough compared to the whole data set, the
effect is not very remarkable to the detection and the real-time efficiency in data
cleaning through this method. Therefore, considering that each pair of records is
compared through one to one matching of the corresponding attribute, it’s adoptable to
improve the efficiency by reducing unnecessary attribute matching, so this paper
proposes an improved data cleaning algorithm, which uses sliding window with
variable size and speed in order to avoid missing record comparisons and reduce
unnecessary ones, the cosine similarity algorithm is adopted to do the attribute
matching to improve the matching accuracy. We also raise the Top-k effective weight
filtering algorithm, firstly we select Top-k attributes with higher weights to start the
matching, then calculate the sum of the k similarity values combined with weights and
decide whether to match the remain attributes according to the result of comparison
between the sum and the threshold, if the sum is less than the threshold set previously,
An Improved Data Cleaning Algorithm Based on SNM 261
we continue to do the comparison, otherwise we consider the two records are not
approximately duplicate and finish the comparison in advance. By reducing the number
of attribute comparisons, the real-time efficiency of record detection is greatly
improved.
2 SNM Algorithm
similarity between R1 and Rn as Sim(R1,Rn). The initial value of the window size wi
equals wmin, firstly compare the record R1 with other records in the window, when it
comes to the record Rwmin and Sim(R1,Rwmin) is more than lowthreshold, then enlarge
the window size wi and continue to compare R1 with the next record until Sim(R1,Rn) is
less than lowthreshold or wi exceeds wmax, at the same time we adjust the moving speed
of the window during the process of record comparison in order to improve matching
efficiency, the current moving speed is calculated as follows:
Sim num
Vi ¼ IntðWi ðWi 1ÞÞ ð1Þ
Wi
According to the descriptions above we can see that vi complies with linear
changes, the value of Sim_num is between 0 and wi, When Sim_num = wi, the moving
speed vi equals 1, when Sim_num = 0, the moving speed is vi.
x1 x2 þ y1 y2
cosð/Þ ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffi
ffi pffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ
x1 þ y21 x22 þ y22
2
we extend the formula to multidimensional space, assuming that A and B are two
multidimensional vectors, A = (A1,A2,…,An), B = (B1,B2,…,Bn), so the cosine between
A and B is as follows:
Pn
ðAi Bi Þ
ffiffiffiffiffiffiffiffiffiffiffiffiffi
cosð/Þ ¼ Pn 1 2 pffiffiffiffiffiffiffiffiffiffiffiffiffi
p Pn 2 ð3Þ
1 Ai 1 Bi
An Improved Data Cleaning Algorithm Based on SNM 263
The comparison between the records depends on the comparison between the
attributes in detection of approximately duplicate records, so it has an important
influence on the algorithm efficiency to segment the value of attributes into strings, we
apply the cosine similarity algorithm in the detection because of its small time com-
plexity and strong recognition ability. So we record the similarity between R1 and R2 as
Sim(R1,R2), segment the attribute into strings and sort the strings in alphabetical order,
represent the strings as a vector with weights, so we get the similarity value between
two attributes by calculating the cosine of the angle between inner product space of two
vectors and then combine the sum of attribute similarity values with the weights we can
get the value of Sim(R1,R2). Assuming that R1 and R2 represent two records with each
attribute given a special weight Wi and there are n attributes participating in compar-
ison, Sim(R1,R2) means record similarity and Sim(R1i,R2i) means attribute similarity, so
the records’ similarity can be computed as follows:
P
n
Valid½i W i SimðR1i ; R2i Þ
i¼1
SimðR1; R2Þ ¼ P
n ð4Þ
Valid½i Wi
i¼1
Only when both records are not null or null at the same time for the attribute i, the
comparison result for that attribute can be counted and Valid[i] is set to 1, otherwise
Valid[i] is set to 0. From line 1 to 6, the function fills the array Valid to ensure the null
attributes will affect the result of comparisons. From line 7 to 12, the function ranks the
attributes in descending order according to their weights. From line 13 to 20, the
function calculates the similarity value of R1 and R2 as the formula shows. During the
comparing, the function firstly selects the Top-k attributes with higher weights to do
comparisons because they are more representative in the similarity of records. From
line 21 to 31, the function compares the value KSimilar with the threshold W2, if
KSimilar is less than W2, then continue to calculate the similarity of remain attributes
to get the record similarity, otherwise we finish the progress.
An Improved Data Cleaning Algorithm Based on SNM 265
There are two evaluation criteria for the approximately duplicate records cleaning
algorithm, the recall rate and the precision rate.
5 Experiment
We compare the improved algorithm with the SNM algorithm in the same operating
environment according to three aspects, the recall rate, the precision rate and execution
time. The experimental operating environment is Inter CPU 2.93 GHz and the memory
is 2 GB, the software environment is as follows: the operating system is Windows7
Ultimate Microsoft, the experimental program is written in Java and the development
environment is Visual Studio Microsoft 2010. The data set for test comes from the
daily monitoring data of a ship management system, due to the condition that the value
of the threshold in SNM algorithm is less than 1 we set LowThreshold to 0.63, during
the running process of the algorithm the window size ranges from 10 to 100.
LowThreShold=0.63
100
1000
99 5000
10000
98
97
96
Recall rate• %•
95
94
93
92
91
90
30 40 50 60 70 80 90
Window size
Figure 1 shows different recall rates with different window sizes and the value of
LowThreshold is 0.63. We can see that the recall rate increases with the increase of the
window size, so the algorithm can identify more duplicate records and improve the
efficiency of the detection.
Figure 2 shows different execution time with different window sizes and the value
of LowThreshold is 0.63. We can see that the execution time’s growth rate is com-
paratively gentle and the algorithm’s performance is stable.
LowThreShold=0.63
220
1000
200
5000
180 10000
160
140
Execution time• S•
120
100
80
60
40
20
0
30 40 50 60 70 80 90
Window size
80
Recall rate(*100%)
60
40
20
0
10 20 30 40 50 60 70
Data records(*103)
80
Precision rate(*100%)
60
40
20
0
10 20 30 40 50 60 70
Data records(*103)
From the results above we can see in the improved SNM data cleaning algorithm,
the preprocessing procedure of the attributes during the sort of the data set and the
application of the cosine similarity algorithm plus the Top-k effective weight filtering
algorithm during the calculation of attribute similarity have remarkably improved the
recall rate and precision rate compared with the SNM algorithm, and also the usage of
the sliding window with variable size and speed during record comparison has obvious
improvement on the execution time.
268 M. Li et al.
200
100
0
10 20 30 40 50 60 70
3
Data records(*10 )
6 Conclusion
References
1. Hylton, J.A.: Identifying and merging related bibliographic records. M S dissertation, MIT
Laboratory for Computer Science Technical Report, MIT, p. 678 (1996)
2. Li, J.: Improvement on the algorithm of data cleaning based on SNM. Comput. Appl. Softw.
(2008)
3. He, L., Zhang, Z., Tan, Y., Liao, M.: An efficient data cleaning algorithm based on attributes
selection. In: 2011 6th International Conference on Computer Sciences and Convergence
Information Technology (ICCIT), pp. 375–379. IEEE (2011)
4. Madnick, S.E., Wang, R.Y., Lee, Y.W., Zhu, H.W.: Overview and framework for data and
information quality research. ACM J. Data Inf. Qual. 1, 2 (2009)
5. Omar, B., Hector, G., David, M., Jennifer, W., Steven, E., Su, Q.: Swoosh: a generic
approach to entity resolution. VLDB J. 18, 255–276 (2009)
An Improved Data Cleaning Algorithm Based on SNM 269
6. Sotomayor, B.: The globus toolkit 3 programmer’s tutorial, 2004, pp. 81–88, Zugriffsdatum
(2005). http://gdp.globus.org/gt3-tutorial/multiplehtml/
7. Monge, A., Elkan, C.: The field matching problem: algorithms and applications. In:
Proceedings of the 2nd International Conference of Knowledge Discovery and Data Mining
(1996)
8. Krishnamoorthy, R., Kumar, S.S., Neelagund, B.: A new approach for data cleaning process.
In: Recent Advances and Innovations in Engineering (ICRAIE), pp. 1–5. IEEE (2014)
9. Chen, H.Q., Ku, W.S., Wang, H.X., Sun, M.T.: Leveraging spatio-temporal redundancy for
RFID data cleansing. In: Proceedings of the 2010 ACM SIGMOD International Conference
on Management of Data, pp. 51–62 (2010)
10. Zhang, F., Xue, H.F., Xu, D.S., Zhang, Y.H., You, F.: Big data cleaning algorithms in cloud
computing. Int. J. Interact. Mobile Technol. (2013)
11. Arora, R., Pahwa, P., Bansal, S.: Alliance rules for data warehouse cleansing. In:
International Conference on Signal Processing Systems, pp. 743–747. IEEE (2009)
12. Ali, K., Warraich, M.: A framework to implement data cleaning in enterprise data warehouse
for robust data quality. In: 2010 International Conference on Information and Emerging
Technologies (ICIET), pp. 1–6. IEEE (2010). 978-1-4244-8003-6/10
13. Li, J., Zheng, N.: An improved algorithm based on SNM data cleaning algorithm. Comput.
Appl. Softw. 25(2), 245–247 (2008). doi:10.3969/j.issn.1000-386X.2008.02.089
14. Luo, Q., Wang, X.F.: Analysis of data cleaning technology in data warehouse. Comput.
Program. Skills Maintenance 2 (2015)
15. Dai, J.W., Wu, Z.L., Zhu, M.D.: Data Engineering Theory and Technology, pp. 148–155.
National Defense Industry Press, Beijing (2010)
16. Zhang, J.Z., Fang, Z., Xiong, Y.J.: Data cleaning algorithm optimization based on SNM.
J. Central South Univ. (Nat. Sci. Ed.) 41(6), 2240–2245 (2010)
17. Wang, L., Xu, L.D., Bi, Z.M., Xu, Y.C.: Data cleaning for RFID and WSN integration. Ind.
Inf. IEEE Trans. 10(1), 408–418 (2014)
18. Tong, Y.X., Cao, C.C., Zhang, C.J., Li, Y.T., Lei, C.: CrowdCleaner: data cleaning for
multi-version data on the web via crowd sourcing. In: 2014 IEEE 30th International
Conference on Data Engineering (ICDE), pp. 182–1185. IEEE Computer Society (2014)
19. Volkovs, M., Fei, C., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: 2014 IEEE 30th
International Conference on Data Engineering (ICDE), pp. 244–255. IEEE (2014)
20. Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Quzzani, M., Tang, N.:
NADEEF: a commodity data cleaning system. In: Proceedings of the 2013 ACM SIGMOD
International Conference on Management of Data. ACM (2013)
21. Ebaid, A., Elmagarmid, A., Ilyas, I.F., Quzzani, M., Yin, S., Tang, N.: NADEEF: a
generalized data cleaning system. Proc. VLDB Endowment 6, 1218–1221 (2013)
22. Broeck, J.V.D., Fadnes, L.T.: Data cleaning. Epidemiol. Principles Pract. Guidel. 66 (2013)
Enhancing Security of IaaS Cloud
with Fraternal Security Cooperation Between
Cloud Platform and Virtual Platform
Jie Yang(&), Zhiqiang Zhu, Lei Sun, Jingci Zhang, and Xianwei Zhu
1 Introduction
Cloud computing [1–3] is a way to offer on-demand computational resources over the
network in an easy and transparent manner. Thanks to the increasing demand for
Internet services and communication, the extent and importance of cloud computing are
rapidly increasing and obtaining a huge attention in the industrial and scientific com-
munities. Cloud computing comprised of three service models and four deployment
models. These service models are Software as a Service (SaaS), Platform as a Service
(PaaS), and Infrastructure as a Service (IaaS). And, the four deployment models are the
private, community, public and hybrid cloud, that refer to the location of the cloud
infrastructure.
IaaS provides virtual and physical hardware as a service and entire infrastructure is
delivered over the Internet. And it is the most offered cloud service layer by public cloud
providers and also the most used by customers. There are numerous open source cloud
platform that offers building IaaS over Internet, such as OpenStack [4], OpenNebula [5],
Eucalyptus [6], CloudStack [7] and so on. However, plenty of vulnerabilities have been
© Springer International Publishing Switzerland 2015
Z. Huang et al. (Eds.): ICCCS 2015, LNCS 9483, pp. 270–282, 2015.
DOI: 10.1007/978-3-319-27051-7_23
Enhancing Security of IaaS Cloud with Fraternal Security Cooperation 271
ascertained in many services they provided, ranging from denial of service to authen-
tication bypass to abortive input-validation to malicious insiders. The cloud platform
appears to assume complete trust among services that is why these vulnerabilities are
particularly problematic. Thus, no matter which one service is compromised may impact
the security of other cloud services. Further, compromised cloud services can launch
masses of programs as privileged processes, qualifying adversaries to do everything he
wanted on cloud service hosts.
What’s more, there is an obvious vulnerability that cloud platform executes man-
agement command directly in privileged VM of virtual platform with bare-metal
Hypervisor, such as Xen [8], Hyper-V [9], VMware ESXi [10] and so on. Cloud
customers can straight access the cloud service hosts when they obtain cloud service
(virtual machine, for example), in some service deploy architecture. That is because
cloud platform vendors want to free management server from huge access pressure.
Clearly, this vulnerability exposes the privileged VM to both legitimate users and
adversaries who can access the network. As we all know, once privileged VM is
compromised by adversaries, that means the virtual platform and all the virtual machines
running on this virtual platform are compromised.
While the cloud platform vendors take countermeasures to protect the host running
could services from attack. But practices have proven current approaches are incom-
plete. For example, OpenStack requires that forwarding requests to services must be
authorized firstly. But, request is turned into several interface calls among services, and
the safety of invoking method call is not validated. Also, Researchers have done a lot of
work to protect virtual platform from internal or external attacking. For instance,
HyperSafe [11] and HyperSentry [12] both address to enhance the security of the
hypervisor by either enforcing control-flow integrity or measuring its integrity
dynamically. NOVA [13] is a micro-kernel based hypervisor [14, 15] that disaggre-
gates the traditional monolithic hypervisor into a system with different purpose com-
ponents, and improves its security by enforcing capability-based access control for
different components in hypervisor. Murray, Milos and Hand propose a way to improve
the security of the management software in Xen by moving the domain building
utilities into a separate domain. Further, In Xoar [16], the functionality of Dom0 has
been separated into nine classes of single-purpose service VMs, each of which contains
a piece of functionality logic that has been decoupled from the monolithic Dom0. But,
the safety protection measures of cloud platform and virtual platform are independent
from each other and have no consistent security goals. Therefore, it’s hard to guarantee
the security of IaaS cloud.
In this paper, we propose a substitutable approach, called SCIaaS, which introduces
security cooperation between cloud platform and virtual platform to address privacy
and security issues in IaaS. Our approach uses both security features of virtual platform
and cloud platform to isolate privileged VM from the open network environment, and
isolate management network from Internet. Also, it protects user’s privacy from attacks
launched by malicious insiders. In conclusion, this paper makes the following
contributions.
• The case of using virtual machine isolation to isolate privileged VM from NetVM,
which is the network components separated from privileged VM and building flow
272 J. Yang et al.
channel between privileged VM and NetVM to fine-grained control the control flow
from cloud platform. That significantly reduces the attack surface [17] of privileged
VM and isolates the management network from Internet.
• A set of protection techniques that provide users’ privacy protection against
adversaries who are even malicious insiders.
• A prototype implementation of our approach that leverages OpenNebula and Xen,
which is demonstrated with a little performance overhead.
In this paper, the IaaS cloud architecture we discussed is based on OpenNebula and
Xen, and the remainder of this paper proceeds as follows. Section 2 identifies threats to
IaaS cloud and describes the threat model. Section 3 firstly discusses our design goals,
and then describes our approaches and the overall architecture of security enhanced
IaaS cloud. Section 4 describes how to securely control the information flow between
privileged VM and the network VM. Section 5 discusses the virtual machine images
protection. The prototype and performance evaluation results are discussed in Sect. 6.
We then conclude this paper and possible future work in Sect. 7.
This section first describes the attack surface of IaaS cloud built with OpenNebula and
Xen, and then talks over the threat model of SCIaaS.
This section first depicts the design goals of SCIaaS, and then describes approaches to
achieving these goals. Finally, we present the overall architecture of SCIaaS.
any of them places the overall virtual platform in danger. Hence, reducing the attack
surface of privileged VM is import to reduce security risks of virtual platform and
further enhance IaaS security.
Every Action with Authentication: It is import to guarantee that each action of
cloud users and cloud operator must be authenticated. Therefore, not only cloud
platform should authenticate users and operators but also virtual platform. And it’s
necessary that virtual platform is fully aware of the high-level information about vis-
itor. Furthermore, the root of authentication is not provided by IaaS cloud provider, it
should be the trusted third party.
Privacy Protection: Clearly, privacy is one of the most relevant inconveniences for
fastening the widespread adoption of the cloud for many business-critical computa-
tions, that is because that the underlying network, storage, and computing infrastructure
is controlled by cloud providers. Hence, privacy protection is an important part of
security-enhancing IaaS cloud.
Security Cooperation: As cloud platform and virtual platform are located in dif-
ferent layers of IaaS cloud, they also own different security attributes in IaaS cloud,
therefore, it is a flawed trust assumption that virtual platform is trusted for cloud
platform and only cloud platform influences the security of IaaS. Apparently, just
enhance security of them separately can not guarantee security assurance of IaaS cloud.
Accordingly, we firmly believe that the goal can be achieved by integrating the security
mechanism of cloud platform and virtual platform.
1
https://en.wikipedia.org/wiki/Attack_surface.
276 J. Yang et al.
the control flow to privileged VM through the control flow protocol. What’s more, the
internal NetVM and external NetVM are in different network segments that isolates
IaaS cloud from Internet very well.
Furthermore, SCIaaS uses the trusted third party to provide authentication service
for IaaS cloud and distributes the authentication to virtual platform, where the accessed
resources located in. Cloud platform and virtual platform utilize each of their security
advantages, existing knowledge and access control decisions to enhance their own
security and further enhance security of IaaS cloud. Also, SCIaaS keeps virtual
machine images always be encrypted when it out of the runtime environment of
privileged VM, hypervisor and the virtual machine.
As the cloud operators manage virtual platform via that cloud platform sends control
command to privileged VM of virtual platform over network, and SCIaaS separates
network functionality from privileged VM. Hence, we build the flow channel between
privileged VM and NetVM to transmit control flow.
as they don’t have the decryption key, but virtual platform can do that with the help of
cloud user who own that virtual machine. And the encryption and decryption all occur
in privileged VM of virtual platform, the key agreement also happened in privileged
VM. That is because we believe the privileged VM is trusted after network separation.
The virtual machine protection in the lifecycle of virtual machine as follows.
1. Instantiating virtual machine: Firstly, the privileged VM obtain the encrypted
virtual machine image through NetVM (file access between privileged VM and
NetVM via BLK_Front and BLK_Back). Secondly, the key manage agent gets key
from the virtual machine owner through key agreement protocol. Finally, privileged
VM decrypt the image, and the virtual machine builder instantiate virtual machine.
2. Virtual machine running: If virtual machine reads file from image, the privileged
VM decrypt the related file, otherwise, the privileged VM encrypt the related file.
3. Destroying virtual machine: When cloud user destroys his or her virtual machine,
the privileged VM encrypts the virtual machine’s states and the plaintext part of
image, and then destroys the key.
Cloud computing offers great potential to improve productivity and reduce costs. It also
poses many new security risks. In this paper, we explore these risks in depth from the
join between cloud platform and virtual platform in IaaS cloud. In particular, we
analyze the attack surface of IaaS cloud, and find that the privileged VM of virtual
platform is the focus. We present an approach (called SCIaaS) that addresses those
vulnerabilities and argue that our approach is implementable and efficient. In our
approach, we propose that utilizing security cooperation between cloud platform and
virtual platform to enhance the security of IaaS cloud. In specifically, we separate
network functionality from privileged VM and build secure flow control channel for
control command transmitted between cloud platform and virtual platform. Also, we
protect the virtual machine images from malicious attack via cryptographic. At last, we
implement the prototype of SCIaaS based on OpenNebula4.8 and Xen4.4, and
experiments show that the prototype system just incurs little performance overhead.
In our approach, security cooperation between cloud platform and virtual platform
just stays in some security mechanisms, and doesn’t think about it with the security
requirement of the whole system. We expect that bringing security policy into both
cloud platform and virtual platform with unified standard in the future.
References
1. Armbrust, M., Fox, A., Griffith, R., Joseph, A.D., Katz, R., Konwinski, A., et al.: A view of
cloud computing. Commun. ACM 53(4), 50–58 (2010)
2. NIST, NIST: The NIST definition of cloud computing. Commun. ACM 53(6), 50–50 (2011)
3. Wei, L., Zhu, H., Cao, Z., Dong, X., Jia, W., Chen, Y., et al.: Security and privacy for
storage and computation in cloud computing. Inf. Sci. 258(3), 371–386 (2014)
4. Corradi, A., Fanelli, M., Foschini, L.: VM consolidation: a real case based on openstack
cloud. Future Gener. Comput. Syst. 32(2), 118–127 (2014)
5. Milojičić, D., Llorente, I.M., Montero, R.S.: Opennebula: a cloud management tool. IEEE
Internet Comput. 15(2), 11–14 (2011)
282 J. Yang et al.
6. Sempolinski, P., Thain, D.: A comparison and critique of eucalyptus, OpenNebula and
Nimbus. In: 2010 IEEE Second International Conference on Cloud Computing Technology
and Science (CloudCom), pp. 417–426. IEEE (2010)
7. Paradowski, A., Liu, L., Yuan, B.: Benchmarking the performance of OpenStack and
CloudStack. In: 2014 IEEE 17th International Symposium on Object/Component/Service-
Oriented Real-Time Distributed Computing (ISORC), pp. 405–412. IEEE Computer Society
(2014)
8. Barham, P., Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., et al.: Xen and the art of
virtualization. In: Proceedings of SOSP-03: The Nineteenth ACM Symposium on Operating
Systems Principles, vol. 19, pp. 164–177. ACM, New York, NY (2003)
9. Leinenbach, D., Santen, T.: Verifying the Microsoft hyper-V hypervisor with VCC. In:
Cavalcanti, A., Dams, D.R. (eds.) FM 2009. LNCS, vol. 5850, pp. 806–809. Springer,
Heidelberg (2009)
10. Tian, J.W., Liu, X.X., Xi, L.I., Wen-Hui, Q.I.: Application on VMware Esxi virtualization
technique in server resource integration. Hunan Electr. Power 6, 004 (2012)
11. Wang, Z., Jiang, X.: HyperSafe: a lightweight approach to provide lifetime hypervisor
control-flow integrity. In: Proceedings of S&P, Oakland, pp. 380–395 (2010)
12. Azab, A.M., Ning, P., Wang, Z., Jiang, X., Zhang, X., Skalsky, N.C.: HyperSentry: enabling
stealthy in-context measurement of hypervisor integrity 65. In: Proceedings of the 17th
ACM Conference on Computer and Communications Security, pp. 38–49. ACM (2010)
13. Steinberg, U., Kauer, B.: NOVA: a microhypervisor-based secure virtualization architecture.
In: Proceedings of the European Conference on Computer Systems, pp. 209–222 (2010)
14. Dall, C., Nieh, J.: KVM/ARM: the design and implementation of the Linux arm hypervisor.
In: Proceedings of International Conference on Architectural Support for Programming
Languages and Operating Systems, vol. 42, pp. 333–348 (2014)
15. Seshadri, A., Luk, M., Qu, N., Perrig, A.: SecVisor: a tiny hypervisor to provide lifetime
kernel code integrity for commodity oses. SOSP 41(6), 335–350 (2007)
16. Colp, P., Nanavati, M., Zhu, J., Aiello, W., Coker, G., Deegan, T., et al.: Breaking up is hard
to do: security and functionality in a commodity hypervisor. In: Proceedings of ACM
Symposium on Operating Systems Principles, pp. 189–202 (2011)
17. Manadhata, P.K., Wing, J.M.: An attack surface metric. IEEE Trans. Softw. Eng. 37(3),
371–386 (2011)
18. Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, C., & Eisler, M., et al.
(2000). Nfs version 4 protocol. Ousterhout, “Caching in the Sprite Network File System,”
ACM Transactions on Computer Systems 6(1)
19. Hitz, D., Lau, J., Malcolm, M.: File system design for an NFS file server appliance. In:
USENIX Technical Conference, vol. 1 (1994)
20. Wada, K.: Redundant arrays of independent disks. In: Liu, L., Özsu, T. (eds.) Encyclopaedia
of Database Systems. Springer, New York (2009)
21. Savage, S., Wilkes, J.: AFRAID - a frequently redundant array of independent disks. Parity 2,
5 (1996)
Cloud Platform
Cluster Analysis by Variance Ratio Criterion
and Quantum-Behaved PSO
1 Introduction
The aim of cluster analysis is grouping a set of objects in the principle that objects in
one cluster are more familiar with each other, than those in other clusters. Cluster
analysis belongs to unsupervised learning, and it is a popular technique used for
statistical analysis for data in following applications, such as case-based reasoning [1],
prediction [2], commerce and finance [3], identification, quantitative description [4],
and pattern recognition [5].
So far, there have been a number of experts and scholars made research and
exploration on the cluster analysis, and generated a lot of efficient algorithms [6]:
(i) Connectivity-based Clustering: They are founded on the thought of objects
being more similar with nearby objects than objects farther away. (ii) Center based
© Springer International Publishing Switzerland 2015
Z. Huang et al. (Eds.): ICCCS 2015, LNCS 9483, pp. 285–293, 2015.
DOI: 10.1007/978-3-319-27051-7_24
286 S. Wang et al.
Clustering: Central vector can express clusters. The famous k-means clustering trans-
form the problem to a minimization problem [7]: search the n cluster centers, and
classify the instance to the closest cluster center. (iii) Distribution-based Clustering: All
the objects belong to the same distribution presumably which called clusters. This kind
of method approximately represents that sampling random objects generate artificial
datasets from a distribution. (iv) Density-based Clustering: We define the area with
higher density than the remainder of the dataset as the clusters. For the objects in the
sparse areas, we suppose them as noise and border points, such as DBSCAN.
In this paper, we concentrate our mind on center-based methods. This method
includes two typical algorithms: (1) fuzzy c-means clustering (FCM), and (2) k-means
clustering. The algorithm gets the solution via iterating which depends on the initial
partition. If an improper initial partition was selected, the result could be trapped to a
local minimum, and therefore could not obtain the global minimal point.
To solve above problem, the bound and branch algorithm was used to find the
global optimum clustering [8]. However, the cost of computation time is high, which
impedes its usage. In the last decade, researchers proposed the evolutionary algorithms
for clustering, considering they are insensitive to initial settings and they are capable of
jumping out of local minimal areas. Take as instances, the clustering approach based on
the combinatorial particle swarm algorithm was proposed by Jarboui et al. [9].
Cross-cultural research in point and applied multi-algorithm voting (MAV) method for
cluster analysis was proposed by Gelbard et al. [10]. Considering that the k-means
algorithm closely depended on the initial population and converged to local optimum
areas, Niknam and Amiri [11] presented a new hybrid evolutionary algorithm named
FAPSO-ACO-K,which consisted of ant colony optimization (ACO), k-means algo-
rithms, and fuzzy adaptive particle swarm optimization (FAPSO). Zhang et al. [12]
proposed a chaotic artificial bee colony (CABC) to find the solution of the partition
clustering problem. Abul Hasan and Ramakrishnan [13] believed that optimize robust
and flexible techniques must produce good consequence for clustering data. They
provided an investigation of combined evolutionary algorithms for cluster analysis.
A new Dynamic Clustering approach based on Particle swarm optimization and
Genetic algorithm (DCPG) algorithm was proposed by Kuo et al. [14]. Then, Zhang
and Li [15] employed firefly algorithm (FA) and tested 400 data of 4 groups. Yang
et al. [16] proposed and explored the idea of exemplar-based clustering analysis
optimized by genetic algorithms (GA). Wan [17] once used the method of combination
of k-means clustering analysis (K) and particle swarm optimization (PSO) named
KPSO to solve the problems of building landslide susceptibility maps. A multi-objec-
tive optimization (MOO) method combined with cluster analysis was proposed by
Palaparthi et al. [18] to study the relation of the form-function with vocal folds.
NSGA-II (An evolutionary algorithm) was utilized to integrate MOO with the laryngeal
sound source (a finite element model). Zhang et al. [19] proposed the weighted k-means
clustering based on improved GA. Ozturk et al. [20] proposed a dynamic clustering
method with improved binary artificial bee colony algorithm.
However, above algorithms have following shortcomings. The speed of conver-
gence is too slow, or even trapped to local minima points that can cause a wrong
solution. In this paper, we presented the quantum-behaved PSO (QPSO) algorithm for
optimization.
Cluster Analysis by Variance Ratio Criterion 287
The remainder of this article is made up of the following content. Section 2 defines
the clustering model, and meanwhile, the encoding strategy and clustering criterion are
provided. Section 3 introduces the mechanisms of QPSO. Section 4 introduces the
experiments, which have three categories of data generated manually with different
overlapping degrees. Finally, Sect. 5 is about some conclusions and offers future works.
2 Model Definition
We will depict the problem of partitional clustering in this section. It is supposed that
there are n samples B = {b1, b2, …, bn} in the d-D metric space. We divide the samples
into k groups. Each bi 2 Rd is a feature vector with d real valued measures of the given
objects. The clusters can be depicted in the form of U = fu1 ; u2 ; . . .; uk g, then those
clusters should comply with the principles as follows:
um 6¼ / ðm ¼ 1; 2; 3; . . .; kÞ
um \uv ¼ / ðm 6¼ vÞ ð1Þ
[um ¼ f1; 2; 3; . . .; kg
The problem of optimization model lies in finding out the minimal partition U*,
which performs the best among all solutions. In order to translate the cluster analysis
task to an optimization problem, two related issues need to be solved: (1) the encoding
strategy; (2) the criterion function.
Cluster Object
One (1,5,7)
Two (2,6,8)
Three (3,4,9)
x1 x2 x3 x4 x5 x6 x7 x8 x9
X 1 2 3 3 1 2 1 2 3
X2 n k
VARAC ¼ ð2Þ
X1 k 1
Here X1 stands for the in-cluster and X2 denotes between-cluster variations. They
are defined as:
nj
k X
X T
X1 ¼ bij bj bij bj ð3Þ
j¼1 i¼1
X
k T
X2 ¼ nj oj o oj o ð4Þ
j¼1
in which nj denotes the cardinal of the cluster cj, bij denotes the i-th object assigned to
the cluster cj,
b j stands for sample means of n-dimensional size, which belongs to j-th
cluster (cluster center), and b stands for the n-dimensional vector of overall sample
means (data center). (n−k) stands for the freedom degree of the in-cluster variations,
and (k−1) stands for the freedom degree of the between-cluster variations.
So, compact clusters are desired to have small values of X1 and separated clusters
are desired to be with big values of X2. Hence, the desired data partition leads to a
larger value of VARAC. In order to make VARAC as an maximization criterion, we
use the normalization term of (n−k) divided by (k−1), so that preventing the ratio from
increasing monotonically with cluster number.
3 Optimization Algorithm
3.1 PSO
PSO use a group of particles, updating over iterations, to performs searching [21]. For
seeking the optimal solution, each particle keeps track of two types of best positions:
(1) the previously best (pb) position; (2) the global best (gb) position in the swarm [22].
in which, i stands for the index of the particle among the population from 1 to N, which
denotes the size of the population, i.e., the number of all particles. t is the index number
of present iteration, f the fitness function, i.e., the objective function of the optimization
problem, and P the position of the particle. It uses two equations of both (7) and (8) to
renew the two characteristics of the particles: the velocity (T) and the position (P) of all
particles.
Pi ðt þ 1Þ Ti ðt þ 1Þ þ Pi ðtÞ ð8Þ
here ω stands for the inertia value, which was employed to make proper balance
between the global exploration and local exploitation of the fitness function. q1 and q2
are randomly distributed variables in the range of [0, 1]. c1 and c2 are acceleration
coefficients, which are positive constant parameters.
It is necessary to set an upper value for the velocity (T). We used the technique of
velocity clamping [23] to prevent particles from going out of the solution space. This
can help improve the divergence degree of particles within the solution space [24].
3.2 QPSO
The main shortage of PSO is that it has a tendency to fall into local optima, although its
excellent performance in convergence [25]. Inspired by trajectory analysis and quan-
tum mechanics, the quantum-behaved PSO (QPSO) is developed.
The PSO works as a quantum-like system, among which each single particle of the
system is supposed spinless and acts a quantum behavior based on the wave function
(WF) without taking account of the interference from other particles. Further, assume
each individual particle moves towards a potentially better delta of the search space.
van den Bergh and Engelbrecht [26] yielded each particle converges to its local
attractor ai defined as
c1 pbestðiÞ þ c2 gbest
ai ¼ ð9Þ
c1 þ c2
In QPSO, the search space and solution space have different quality. Without
offering any details about the particle position, which is important to measure the
fitness value, WF or probability function of the particle position, which describes the
particle state in the quantized search space. Therefore, it is necessary to perform state
transformation among two spaces [27].
Collapse, which is used to evaluate the particle position, denotes the transformation
from quantum state to classical state. Based on the Monte Carlo method, the particles
position is expressed as:
290 S. Wang et al.
8
< ai ðtÞ þ Li ðtÞ
2 lnðz1 Þ z2 [ 0:5
1
Pi ðt þ 1Þ ¼ ð10Þ
: a ðtÞ Li ðtÞ lnð 1 Þ otherwise
i 2 z1
here z1 and z2 are random numbers generated by the uniform probability distribution
function, varying between 0 and 1. L is closely related to energy intension of the
potential well, specifying the search range of a particle named as the creativity or
imagination of the particle. Its formulation can be given by
Li ¼ 2#jmb Pi j ð11Þ
Formula (12) serves as the iterative function of the particle positions of the
QPSO [28].
The simulations were run on the platform of a HP laptop with 3.2 GHz processor and
16 GB RAM, with Windows 8 operating system. The program codes were in-house
developed based on the 64bit Matlab 2015a (The Mathworks ©), which is a high-level
numerical computing environment and 4th generation programming language.
Table 1. Simulation results for four-hundred artificial data based on twenty runs
Overlapping degree VARAC GA [16] CPSO [9] FA [15] QPSO (Proposed)
NO Best 1683.2 1683.2 1683.2 1683.2
Mean 1321.3 1534.6 1683.2 1683.2
Worst 451.0 1023.9 1683.2 1683.2
PO Best 620.5 620.5 620.5 620.5
Mean 594.4 607.9 618.2 618.7
Worst 512.8 574.1 573.3 589.1
IO Best 275.6 275.6 275.6 275.6
Mean 184.1 203.8 221.5 238.7
Worst 129.0 143.5 133.9 151.2
From Table 1, we found that for NO instances, the four algorithms find the optimal
VARAC of 1683.2 at least for one time. Additional to it, both FA and QPSO succeeds
for all 20 runs.
For the PO data, all the four algorithms: GA, CPSO, FA, and QPSO can get the
optimal VARAC of 620.5 by 20 runs. The average values of VARACs by those
algorithms are 594.4, 607.9, 618.2, and 618.7, respectively. The worst values of
VARACs are 512.8, 574.1, 573.3 and 589.1. This suggested that the QPSO performs
better than GA, CPSO, and FA.
For the IO data, the GA, CPSO, FA, and QPSO can find the best VARAC of 275.6.
Again, the average and worst VARACs of QPSO are 238.7 and 151.2, which are better
than those of GA (184.1 and 129.0), CPSO (203.8 and 143.5), and FA (221.5 and
133.9). This further validated the effectiveness of QPSO.
References
1. Zhu, G.N., Hu, J., Qi, J., Ma, J., Peng, Y.H.: An integrated feature selection and cluster
analysis techniques for case-based reasoning. Eng. Appl. Artif. Intell. 39, 14–22 (2015)
2. Amirian, E., Leung, J.Y., Zanon, S., Dzurman, P.: Integrated cluster analysis and artificial
neural network modeling for steam-assisted gravity drainage performance prediction in
heterogeneous reservoirs. Expert Syst. Appl. 42, 723–740 (2015)
3. Tsai, C.F.: Combining cluster analysis with classifier ensembles to predict financial distress.
Inf. Fusion 16, 46–58 (2014)
4. Klepaczko, A., Kocinski, M., Materka, A.: Quantitative description of 3D vascularity
images: texture-based approach and its verification through cluster analysis. Pattern Anal.
Appl. 14, 415–424 (2011)
5. Illarionov, E., Sokoloff, D., Arlt, R., Khlystova, A.: Cluster analysis for pattern recognition
in solar butterfly diagrams. Astro. Nachr. 332, 590–596 (2011)
6. Wang, S., Ji, G., Dong, Z., Zhang, Y.: An improved quality guided phase unwrapping
method and its applications to MRI. Prog. Electromagnet. Res. 145, 273–286 (2014)
7. Ayech, M.W., Ziou, D.: Segmentation of Terahertz imaging using k-means clustering based
on ranked set sampling. Expert Syst. Appl. 42, 2959–2974 (2015)
8. Trespalacios, F., Grossmann, I.E.: Algorithmic approach for improved mixed-integer
reformulations of convex generalized disjunctive programs. INFORMS J. Comput. 27,
59–74 (2015)
9. Jarboui, B., Cheikh, M., Siarry, P., Rebai, A.: Combinatorial particle swarm optimization
(CPSO) for partitional clustering problem. Appl. Math. Comput. 192, 337–345 (2007)
10. Gelbard, R., Carmeli, A., Bittmann, R.M., Ronen, S.: Cluster analysis using multi-algorithm
voting in cross-cultural studies. Expert Syst. Appl. 36, 10438–10446 (2009)
11. Niknam, T., Amiri, B.: An efficient hybrid approach based on PSO, ACO and k-means for
cluster analysis. Appl. Soft Comput. 10, 183–197 (2010)
12. Zhang, Y., Wu, L., Wang, S., Huo, Y.: Chaotic artificial bee colony used for cluster analysis.
In: Chen, R. (ed.) ICICIS 2011 Part I. CCIS, vol. 134, pp. 205–211. Springer, Heidelberg
(2011)
13. Abul Hasan, M.J., Ramakrishnan, S.: A survey: hybrid evolutionary algorithms for cluster
analysis. Artif. Intell. Rev. 36, 179–204 (2011)
14. Kuo, R.J., Syu, Y.J., Chen, Z.Y., Tien, F.C.: Integration of particle swarm optimization and
genetic algorithm for dynamic clustering. Inf. Sci. 195, 124–140 (2012)
15. Zhang, Y., Li, D.: Cluster analysis by variance ratio criterion and firefly algorithm. JDCTA:
Int. J. Digit. Content Technol. Appl. 7, 689–697 (2013)
Cluster Analysis by Variance Ratio Criterion 293
16. Yang, Z., Wang, L.T., Fan, K.F., Lai, Y.X.: Exemplar-based clustering analysis optimized
by genetic algorithm. Chin. J. Electron. 22, 735–740 (2013)
17. Wan, S.A.: Entropy-based particle swarm optimization with clustering analysis on landslide
susceptibility mapping. Environ. Earth Sci. 68, 1349–1366 (2013)
18. Palaparthi, A., Riede, T., Titze, I.R.: Combining multiobjective optimization and cluster
analysis to study vocal fold functional morphology. IEEE Trans. Biomed. Eng. 61, 2199–
2208 (2014)
19. Zhang, T.J., Cao, Y., Mu, X.W.: Weighted k-means clustering analysis based on improved
genetic algorithm. Sens. Mechatron. Autom. 511–512, 904–908 (2014)
20. Ozturk, C., Hancer, E., Karaboga, D.: Dynamic clustering with improved binary artificial
bee colony algorithm. Appl. Soft Comput. 28, 69–80 (2015)
21. Ghamisi, P., Benediktsson, J.A.: Feature selection based on hybridization of genetic
algorithm and particle swarm optimization. IEEE Geosci. Remote Sens. Lett. 12, 309–313
(2015)
22. Zhang, Y., Wang, S., Phillips, P., Ji, G.: Binary PSO with mutation operator for feature
selection using decision tree applied to spam detection. Knowl.-Based Syst. 64, 22–31
(2014)
23. Wang, S., Dong, Z.: Classification of Alzheimer disease based on structural magnetic
resonance imaging by kernel support vector machine decision tree. Prog. Electromag. Res.
144, 171–184 (2014)
24. Wang, S., Zhang, Y., Dong, Z., Du, S., Ji, G., Yan, J., Yang, J., Wang, Q., Feng, C., Phillips,
P.: Feed-forward neural network optimized by hybridization of PSO and ABC for abnormal
brain detection. Int. J. Imaging Syst. Technol. 25, 153–164 (2015)
25. Lin, L., Guo, F., Xie, X.L., Luo, B.: Novel adaptive hybrid rule network based on TS fuzzy
rules using an improved quantum-behaved particle swarm optimization. Neurocomputing
149, 1003–1013 (2015)
26. van den Bergh, F., Engelbrecht, A.P.: A study of particle swarm optimization particle
trajectories. Inf. Sci. 176, 937–971 (2006)
27. Davoodi, E., Hagh, M.T., Zadeh, S.G.: A hybrid improved quantum-behaved particle swarm
optimization-simplex method (IQPSOS) to solve power system load flow problems. Appl.
Soft Comput. 21, 171–179 (2014)
28. Fu, X., Liu, W.S., Zhang, B., Deng, H.: Quantum behaved particle swarm optimization
with neighborhood search for numerical optimization. Math. Prob. Eng. 2013, 10 (2013).
doi:10.1155/2013/469723
29. Zhang, Y., Dong, Z., Wang, S., Ji, G., Yang, J.: Preclinical diagnosis of magnetic resonance
(MR) brain images via discrete wavelet packet transform with tsallis entropy and generalized
eigenvalue proximal support vector machine (GEPSVM). Entropy 17, 1795–1813 (2015)
30. Zhang, Y., Wang, S., Phillips, P., Dong, Z., Ji, G., Yang, J.: Detection of Alzheimer’s
disease and mild cognitive impairment based on structural volumetric MR images using
3D-DWT and WTA-KSVM trained by PSOTVAC. Biomed. Signal Process. Control 21,
58–73 (2015)
Failure Modes and Effects Analysis Using
Multi-factors Comprehensive Weighted Fuzzy
TOPSIS
1 Introduction
Failure mode and effect analysis (FMEA) technique has become an indispensable tool in
the safety and reliability analysis of product or process. The literature [1] integrated the
FMEA and FAHP method, combining with the experts’ evaluation date, to calculate the
risk level of construction and rank the failure modes. FMEA and AHP are used in the
detection and maintenance of large crane for evaluating each component’s importance
and determining the priority in literature [2]. The traditional FMEA method is used to sort
all of the failure modes based on the risk priority number (RPN). The RPN calculation
method evaluates the risk factors occurrence (O), severity (S) and detection (D) of a
failure mode through different intermediary scores according to the real application
situations in different companies and industries. However, the risk factors in real world
are difficult to be precisely estimated and the traditional FMEA has many defects in
evaluate standard, computation method of RPN, the weights of risk factors, etc.
In view of above defects, some methods have been put forward to conquer the
traditional FMEA’s setbacks. The literature [3] had carried on the analysis to the failure
mode from the angle of multi attributes. The literature [4] used Fuzzy rule interpolation
and reduction technology to establish the fuzzy FMEA model. Considering the flexible
evaluation structure of TOPSIS, Sachdeva et al. [5] applied crisp TOPSIS into FMEA
approach and only use the Shannon’s entropy concept to assign objective weights for
the risk factors. Sachdeva et al. [6] also applied this method on a digester of paper mill.
In fact, the crisp TOPSIS is not proper to FMEA approaches although it has reasonable
and flexible structure, because risk factors in practice are difficult to be precisely
estimated. Also, objective weights based on Shannon’s entropy only cannot fully
reflect the importance of the risk factors because it ignores the experts’ knowledge. In
view of these defects, many scholars consider to use fuzzy TOPSIS instead of precise
TOPSIS analysis. The literature [7] provides a two-phase framework consisting of
fuzzy entropy and fuzzy TOPSIS to improve the decision quality of FMEA risk
analysis. But the fuzzy entropy-based weights could also not well reflect the experts’
knowledge and experience. Literatures [8, 9] are the application of TOPSIS in the
performance and risk evaluation.
We put forward a more reasonable, more accurate and more flexible FMEA method
based on fuzzy TOPSIS and comprehensive weighted method. The fuzzy weighted
TOPSIS approach not only benefits from experts’ knowledge and experience but also
makes full use of intrinsic information in the evaluating process. In addition, it also
give experts assigned a certain weight according to the expert’s research area.
The Sect. 2 of this paper introduces a new FMEA based on fuzzy TOPSIS. A case
of metro door system components’ failure is presented in Sect. 3 to illustrate the
feasibility of the proposed method. The fourth part summarizes the thesis and illustrates
the advantages and limitations of the method proposed in this paper.
The fuzzy TOPSIS model is established by Chen and Hwang [10]. Subsequently, Chen
[11] defines the Euclidean distance between two fuzzy numbers. It has laid a good
foundation for the application of fuzzy TOPSIS in the field of FMEA. The principal
line of this paper is fuzzy TOPSIS method. The structural framework of new FMEA
method is shown in Fig. 1.
1h 1 i
~xij ¼ ~xij þ ~x2ij þ þ ~xkij ð1Þ
k
Where ~xkij is the rating of the ith FM related to the jth RF from kth expert.
296 W. Zhang and F. Zhang
Then, all the linguistic variables are converted into trapezoidal fuzzy numbers
according to Table 1. Then, the defuzzifier method proposed by Chen and Klien [12] is
used to get the specific value; the formula is shown as follows:
Pn
i¼0 ðbiP
cÞ
Pij ¼ Pn n ð3Þ
i¼0 ðb i cÞ i¼0 ðai dÞ
Failure Modes and Effects Analysis Using Multi-factors 297
1h 1 i
~j ¼
w w ~ 2j þ þ w
~j þ w ~ kj ð4Þ
k
Where wkj represents the importance of the jth risk factor from the kth experts’
evaluation. L represents the number of fuzzy numbers (for example, the triangular
fuzzy number is 3, and the trapezoidal fuzzy number is composed of 4 values).
According to Eqs. (4, 5), the fuzzy subjective weight is converted into crisp value
wsj. Assuming triangular fuzzy number is M = (m1, m2, m3), then the solution of the
fuzzy equation is expressed as:
m1 þ 4m2 þ m3
~ ¼
P M ð6Þ
6
xij
rij ¼ P
m ð7Þ
xij
i¼1
② Calculate the entropy values of each risk factor using entropy weight method.
Let ej denote the entropy of the jth factor.
1 X m
ej ¼ rij ln rij ð8Þ
ln m i¼1
1 ej
woj ¼ Pn ð9Þ
j¼1 1 ej
In the formula, the greater (1 − ej) is, the more important jth risk factor will be.
Step 3: Integration of risk factor weight
To amplify the difference between weights of RFs, the multiplicative combination
weighting method is adopted. That is, multiply the objective weight and subjective
weight first and then normalize the product to obtain the comprehensive weight.
wsj woj
wj ¼ ð10Þ
P
n
wsj woj
j¼1
Where wsj represents the subjective weight, woj is the objective weight and wj is the
combinational weight.
X
n
diþ ¼ d ~vij ; ~vjþ ð16Þ
j¼1
X
n
di ¼ d ~vij ; ~v
j ð17Þ
j¼1
The distance of two fuzzy numbers will be calculated with the following formula:
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1h i
d ð~
p; ~
qÞ ¼ ðp1 q1 Þ2 þ ðp2 q2 Þ2 þ ðp3 q3 Þ2 þ ðp4 q4 Þ2 ð18Þ
4
Where ~
p ¼ ðp1 ; p2 ; p3 ; p4 Þ; ~q ¼ ðq1 ; q2 ; q3 ; q4 Þ are two trapezoidal fuzzy numbers.
The closeness coefficients of all failure modes for ranking order are obtained as
follows:
di
Ci ¼ ð19Þ
diþ þ di
The closer to the positive ideal solution and farther from the negative ideal solution
the FM is, the closer to 1 the Ci is. Therefore, the managers can select the FM with
greatest risk based on the closeness coefficient for arranging improvement resources.
3 Case Analysis
The paper uses the example-reliability analysis of metro door system failure mode in
literature [14] to illustrate the improved fuzzy TOPSIS method.
300 W. Zhang and F. Zhang
3.1 Background
With the rapid growth of urban population, the subway is playing a more and more
important role in people’s life. The reliability of door system is related to the pas-
sengers’ personal safety and the metro trains’ normal operation. So it is very important
to carry out the risk and reliability analysis for the subway door system in order to find
out the weak link in the process of design and maintenance.
The door system is mainly composed of electrical control components, guiding
device, foundation bearing device and locking device driver subsystem. The operating
principle of the metro door system is shown in Fig. 2.
Assuming that the FMEA analysis team consists of 5 experts from different
departments and the 5 members are given different relative weights. The steps of
FMEA analysis using the comprehensive weighted fuzzy TOPSIS are as follows:
(1) Construct evaluation matrix.
The experts use the linguistic terms (shown in Table 1) to evaluate the ratings of
failure modes with respect to risk factors and present them in Table 4. These linguistic
evaluations are transformed into trapezoidal fuzzy numbers. Then, the fuzzy failure
modes evaluation matrix will be constructed, as shown in Table 5.
② Objective weight. Using formula (7) normalizes the fuzzy failure mode evalu-
ation matrix; the standard value rij of each failure mode is obtained (shown in Table 7).
Then, the entropy ej and the objective weight woj are calculated by using the formulas
(8–9) (shown in Table 8).
Table 8. The ej, 1-ej and objective entropy weight of risk factors
S O D
ej 0.863 0.845 0.861
1-ej 0.136 0.154 0.138
woj 0.317 0.360 0.321
Table 9. The fuzzy normalized evaluation matrix and comprehensive weight for S, O and D
No. S O D
ws = 0.448 wo = 0.362 wd = 0.189
FM1 (0.148,0.255,0.383,0.489) (0.243,0.378,0.567,0.702) (0.297,0.432,0.648,0.783)
FM2 (0.234,0.34,0.51,0.617) (0.189,0.324,0.486,0.621) (0.243,0.378,0.567,0.702)
FM3 (0.148,0.255,0.383,0.489) (0.243,0.378,0.567,0.702) (0.486,0.621,0.864,1)
FM4 (0.723,0.829,0.936,1) (0.486,0.621,0.864,1) (0.513,0.648,0.837,0.973)
FM5 (0.574,0.68,0.808,0.914) (0.405,0.54,0.81,0.945) (0.486,0.621,0.864,1)
FM6 (0.51,0.617,0.766,0.872) (0.486,0.621,0.864,1) (0.351,0.486,0.729,0.864)
FM7 (0.34,0.446,0.617,0.723) (0.081,0.162,0.297,0.432) (0.351,0.486,0.729,0.864)
A ¼ ½ð1; 1; 1; 1Þ; ð1; 1; 1; 1Þ; ð1; 1; 1; 1Þ; A ¼½ð0; 0; 0; 0Þ; ð0; 0; 0; 0Þ; ð0; 0; 0; 0Þ:
Then, the distance diþ , di are calculated according to Eqs. (16)–(18), as shown in
Table 11. The closeness coefficients calculated by the fuzzy TOPSIS are the basis for
ranking order of each FM. The closeness coefficient of each FM based on Eq. (19) and
the rank of all FMs are also determined in Table 11.
The approach based on comprehensive weighted fuzzy TOPSIS is used to rank for
FMs. In Table 11, the ranking order of seven common FMs in door system is as
follows: FM4 > FM5 > FM6 > FM3 > FM7 > FM2 > FM1. That FM4 (EDCU function
failure) is the most important and has the highest priority, followed by FM5 (off-travel
switch S1 damaged), 6 (nut components damage), 3 (long guide pillar bad lubrication),
7 (nut components loosening), 2 (rolling wheel wear) and 1 (roller loose).
Table 11. The closeness coefficient and ranking order of each failure mode
No. diþ di Ci Ranking
FM1 2.588 0.445 0.146 7
FM2 2.578 0.455 0.150 6
FM3 2.549 0.482 0.159 4
FM4 2.205 0.817 0.270 1
FM5 2.287 0.741 0.244 2
FM6 2.312 0.716 0.236 3
FM7 2.563 0.469 0.154 5
literature [14]. These three FMs should be regarded as the key to affect the normal work of
the door system and we should focus on the design and improvement. Similarly, the
ranking is obtained by using the comprehensive weighted fuzzy TOPSIS. The relative
closeness degrees of FM4, FM5 and FM6 are significantly higher than the other 4 FMs
with high priority in design improvement and maintenance. The three FMs of the former
are FM4 > FM5/FM6, while the latter is FM4 > FM5 > FM6. The FM5 and FM6 get the
same gray correlation degree in former, but from the fuzzy assessments in Table 4 we can
see clearly: the values of two FMs are different, so be fully consistent with the results
indicates that the method has some disadvantages. The results obtained by the latter have
little difference and prove the feasibility and applicability of the synthetic weighted fuzzy
TOPSIS method. In the two methods, the lowest priority is FM1 (roller loose), which
means that the importance of roller loose is the lowest in the door system.
In addition, the former does not take the weight of experts into account; each
expert’s weight is also taken into consideration in the latter method. Because these
experts do not specialize in all professional, in order to get more consistent results with
the actual situation, the experts’ weights are considered in the research. This is also an
innovation of the comprehensive weighted fuzzy TOPSIS method.
4 Conclusions
Although the application of fuzzy TOPSIS in the field of FMEA is relatively less, the
results of the research have proved the feasibility and validity of this method. This
paper mainly studies the method of multi-factor comprehensive weighted fuzzy
TOPSIS and its application. Specifically, on the basis of the original method, syn-
thetically consider the subjective and objective factors and assess the differences
between experts with given relative weights, and make the results obtained more
accurate and closer to the actual situation. The advantages of the new FMEA are as
follows:
(1) Do not multiply the value of three risk factors directly and eliminate the existent
questions in the simple multiplication.
Failure Modes and Effects Analysis Using Multi-factors 305
References
1. Hong, Z., Yang, L.: Application of metro project risk evaluation based on FMEA combined
with FAHP and frequency analysis. J. Eng. Manag. 29(1), 53–58 (2015)
2. Shuzhong, Z., Qinda, Z.: Components importance degree evaluation of large crane based
FMEA and variable weight AHP. J. Chongqing Univ. Technol. Nat. Sci. 28(5), 34–38
(2014)
3. Braglia, M.: MAFMA: multi- attribute failure mode analysis. Int. J. Qual. Reliab. Manag. 17
(9), 1017–1033 (2000)
4. Taya, K.M., Lim, C.P.: Enhancing the failure mode and effect analysis methodology with
fuzzy inference technique. J. Intell. Fuzzy Syst. 21(1–2), 135–146 (2010)
5. Sachdeva, A., Kumar, D., Kumar, P.: Maintenance criticality analysis using TOPSIS. In:
Proceedings of the IEEE International Conference on Industrial Engineering and
Engineering Management (IEEM 2009), Hong Kong, pp. 199–203. IEEE Press,
Piscataway (8–11 December 2009)
6. Sachdeva, A., Kumar, D., Kumar, P.: Multi-factor failure mode criticality analysis using
TOPSIS. J. Ind. Eng. Int. 5(8), 1–9 (2009)
7. Wang, C.H.: A novel approach to conduct risk analysis of FMEA for PCB fabrication
process. In: Proceedings of the IEEE International Conference on Industrial Engineering and
Engineering Management (IEEM 2011), Singapore, pp. 1275–1278. IEEE Press, Piscataway
(6–9 December 2011)
8. Xionglin, Z., Caiyun, Z.: Service performance evaluation of TPL suppliers based on
triangular fuzzy TOPSIS. Techn. Methods 34(2), 176–179 (2015)
9. Can, L., Fengrong, Z., et al.: Evaluation and correlation analysis of land use performance
based on entropy-weight TOPSIS method. Trans. Chin. Soc. Agric. Eng. (Trans. CSAE) 29
(5), 217–227 (2013)
10. Chen, S.J., Hwang, C.L.: Fuzzy Multi Attribute Decision Making, vol. 375. Springer, New
York (1992)
11. Chen, C.T.: Extension of the TOPSIS for group decision-making under fuzzy environment.
Fuzzy Sets Syst. 114(1), 1–9 (2000)
12. Chen, C.B., Klien, C.M.: A simple approach to ranking a group of aggregated fuzzy utilities.
IEEE Trans. Syst. Man Cybern. Part B 27(1), 26–35 (1997)
13. Deng, H., Yeh, C., Willis, R.J.: Intercompany comparison using modified TOPSIS with
objective weights. Comput. Oper. Res. 27, 963–973 (2000)
14. Jun, X., Xiang, G.: Failure mode criticality analysis of metro door system. Modular Mach.
Tool Autom. Manuf. Tech. 3, 49–52 (2014)
Efficient Query Algorithm
of Coallocation-Parallel-Hash-Join
in the Cloud Data Center
Abstract. In the hybrid architecture of cloud data center, the data division is an
important factor that affects the performance of query. For the costly join
operations which applies the way of hybrid mapreduce, the overhead of network
transmission and I/O is huge that requires large-scale transmission of data across
the nodes. In order to reduce the data traffic and improve the efficiency of join
queries, this paper proposes an efficient algorithm of Coallocation Parallel Hash
Join (CPHJ). First, CPHJ designs a consistent multi-redundant hashing algo-
rithm that distributes the table with join relationship in the cluster according to
its join properties, which improves the data locality in the join query processing,
but also ensures the availability of the data. Then, On the basis of consistent
multi-redundant hashing algorithm, parallel algorithm of join query called
ParallelHashJoin is proposed that effectively improves the efficiency of join
queries. The CPHJ method applies in the data warehouse system of Alibaba and
experimental results indicate that the workpiece ratio of CPHJ in that query is
nearly five times more likely than the hive system.
1 Introduction
Along with the fast development of Internet applications, researchers will face a tough
challenge towards storing and processing massive data. Traditional database technol-
ogy cannot meet the needs of massive data management due to the weakness of
scalability and the limitation of high cost. In recent years, Google has put forward
many mainstream technologies for storing and analyzing mass data, such as a dis-
tributed file system GFS [1], and a parallel programming framework MapReduce [2–4].
Based on the design philosophy of GFS and MapReduce, the Hadoop project under
the open source community Apache implemented Hadoop-DFS and Hadoop-
MapReduce, the former is a distributed file system and the latter can be understood as
a parallel programming framework. Currently, Hadoop has been widely applied to
Yahoo, Facebook and other Internet companies to deal with storage and analysis of mass
data. Due to MapReduce programming model at a low level, developers need to write
different MapReduce applications to handle different tasks for data analysis, which led to a
problem that programs may be difficult to maintain and reuse. In order to facilitate the
development of the upper application, Hive [5–7], Pig [8] and other techniques have been
proposed to encapsulate MapReduce programming framework, providing call interface
of SQL for the upper application which makes things easier.
In the query of statistical analysis, JOIN is one of the main operations. Hive uses a
sort merge algorithm when dealing with JOIN operation (SortMergeReduceJoin,
hereinafter referred to as Reduce Join). The implementation of the algorithm is cut into
Map and Reduce two stages: In the Map phase, sort the Join properties of the two tables
that to be connected; In the Reduce stage, merge and connect the sorted results that
generated in every map stages and then output the query results. But there are two
problems with this algorithm: (1) a large number of intermediate results generated in
map stage needs to be transmitted to the Reduce side through the network, which
consume a lot of bandwidth and bring negative influence to the efficiency of the algo-
rithm; (2) Multiple merge and sort operations are required, which lead to greater cost and
longer execution time.
In response to these problems, we propose parallel join query processing algorithm
based on the consistency of multi-redundant in mapreduce environment (Co-Location
Parallel HashJoin, hereinafter referred to as CPHJ). The basic idea of CPHJ algorithm
is hash the value of the connection properties in the two tables which often make Join
operations, in dealing with Join queries, CPHJ algorithm can obtain the results by
performing HashJoin algorithm only in the Map stage. So it reduces the time overhead
of a query processing without performing Reduce stage. The proposed algorithm has
been applied in the Alibaba data warehouse (Alibaba distributed data warehouse,
referred to as ADW). The results show that when apply CPHJ to deal with Join queries
on partition properties, the execution time is only about 20 % of the ReduceJoin
algorithm used in Hive. Ideas presented in this paper still apply to the query of
Groupby.
The second section describes related work; then the main idea of the CPHJ algo-
rithm introduced in the third section, also with the analysis of the performance of the
algorithm; in Sect. 4 gives the experimental results and analysis; the last section is the
conclusion and prospect.
2 Related Work
In the storage and processing of massive data, driven by a large number of research
institutes and companies, the development of cutting-edge technology is very quickly.
Google, Amazon, IBM and Microsoft and other companies have invested heavily in
this area of scientific research, proposed variety of innovative management technology
about massive data. These studies focused on three levels: the storage, computing of
massive data and user interface.
The storage layer provides reliable storage and efficient access service of file about
massive data. The main techniques include Google’s distributed file system GFS,
308 Y. Shen et al.
Hadoop’s HDFS, KFS [9, 10], Amazon’s S3 [11, 12] and so on. Computing layer
provides parallel computing services for massive data and use API that provided by the
storage layer to read the data. In the process of implementing, the computing layer will
return the real-time status of the job to the interface layer and feedback the final
execution results until the end of the job. Parallel computing technology towards
processing massive data includes Google’s MapReduce, Hadoop and Microsoft’s
Dryad [13], etc. The function of interface layer is to provide programming API,
interactive shell and web GUI interface. It also makes syntax analysis, semantic
analysis and query optimization to the SQL sentence that users submit to the system
(Or other query languages).Finally, it is converted to one or more jobs to perform by
the computing layer. Currently, the main research work on interface layer includes
Hive, Pig, Scope [14], DryadLINQ [15], Sawzall [16] and so on.
Hive is a representative work on the interface layer that mainly used in Facebook’s
analysis work of massive data. It provide call interface of SQL similar sentence to
upper applications with the Hadoop DFS as the storage engine and Hadoop MapRe-
duce as the computational engine. When processing join queries, Hive mainly adopts
the algorithm of ReduceJoin.
During the execution of ReduceJoin, the Reduce stage only performs after the
completion of all the Map phases. It will bring a lot of network traffic and makes less
efficient that Reduce tasks need to pull the intermediate results produced by the map
side. Meanwhile, Reduce task should perform multiple operations of Sort/Merge which
requires a lot of calculations and results in lower efficiency of the algorithm.
files of different column from the same table are mapped to different nodes, relevant
data can achieve aggregation on physical level so as to enhance the performance of
queries.
without using the consistent multi-redundant hashing algorithm. Assuming the number
of nodes in cluster is M, Bj as the total number of data blocks used in task j, and
bi ð0 i\MÞ on behalf of the number of data blocks used in tasks runs on node i, then
P
M1
Bj ¼ bi ; C represents the number of hash partition, and D stands for the quantity of
i¼0
data, so the average amount of data in each data block is BDj ; the average number of data
B
blocks calculated in the same partition of task j is Cj . In map stage partition works as
processing unit, and the scheduling algorithm will try to ensure the locality of every
B
data blocks, so the remaining number of data blocks in every map is Cj 1, and the
B
j
local probability of data in each map is ðM1 Þ C 1 . Suppose there are R copies, which
B B
j j
probability will be ðMR Þ C 1 , so the local probability of total data is ðMR Þ C 1 C . In the
execution of the ParallelHashJoin algorithm, since the task scheduler will try to ensure
the locality of a data block, the amount of data that needs to be pulled by every map
task is:
2 3
6 7
Bj
1 1 R D ¼ Bj 1 6 1
R D
M1
7
C M Bj C 6 4 M P
7
5
bi
i¼0
in cluster reaches a certain size, such as 50 machines, and the amount of data need to be
pulled are 2.79 TB, 2.736 TB and 2.679 TB when the number of copies respectively are
1, 2 and 3, respectively accounting for 93 %, 91.2 % and 89.3 % of the total amount of
calculations. When the nodes in the cluster reach a larger scale, such as the number of
250, the amount of data to be pulled are 2.839 TB, 2.8272 TB and 2.816 TB, the
proportion of which stand at 94.6 %, 94.3 % and 93.7 % respectively in the total amount
of calculations。
Thus, in case of not using coallocation, only a very small amount of data is able to
achieve locality. The locality of data is very poor that large amount of data need to be
transferred across the network.
Due to the storage layer stores file by dividing blocks, in the same amount of data to
be processed, we adjust the size of the data blocks. Supposing the average numbers of
data blocks to be processed respectively are 10, 20 and 30, the number of copies are all
3 and the remaining parameters unchanged; Fig. 5 shows the results obtained.
As can be seen from Fig. 5, the size of data blocks also have certain effects on the
amount of data to be pulled in the situation of the total data to be processed unchanged.
Configure larger data blocks can reduce the amount of data pulled across the network to
a certain extent which improve the locality of the data. But on the whole, the amount of
data which should be pulled is still great.
In summary, in the case of not using consistent multi-redundant hashing algorithm
to distribute data, the locality of data is comparatively poorer when performing Par-
allelHashJoin. Even in the case of a smaller number of clusters, there is only very small
amounts of data can achieve locality. The network load is very heavy that most of the
data are transmitted through the network. If adopting consistent multi-redundant
hashing algorithm to distribute data, then the relevant data are gathered on the same
physical node which make the locality of data reach 100 % by scheduling the tasks to
the physical nodes that store data aggregately. The network load in the cluster can be
significantly reduced and the performance of join query is able to be improved greatly.
The results are illustrated in Fig. 6 that the performance of ParallelHashJoin is far
better than ReduceJoin used in hive, in short, the execution time of the former is only
49.8 % of the latter one. The big part of the results comes from: (1) the start-up cost of
reduce task is great; (2) Map and Reduce tasks need to pull data from the network,
which involves not only the network overhead and the frequent I/O of the disk.
The CPHJ that consider locality of data improves the performance further compared
with ReduceJoin and ParallelHashJoin, the execution time of which only accounts for
26 % of ReduceJoin used in hive system and the speedup ratio is 3.76.
5 Conclusion
This paper presents an efficient parallel algorithm of join query which is called CPHJ
for short. CPHJ designs a consistent multi-redundant hashing algorithm that distributes
the table with join relationship in the cluster according to its join properties, which
improves the data locality in the join query processing, but also ensures the availability
of the data. On the basis of consistent multi-redundant hashing algorithm, parallel
algorithm of join query called ParallelHashJoin is proposed that effectively improves
the processing efficiency of join queries. The CPHJ method applies in the data ware-
house system of Alibaba and experimental results indicate that the workpiece ratio of
CPHJ in that query is nearly five times more likely than the hive system.
China (863 Program) (No. 2007AA01Z404), Research Fund for the Doctoral Program of High
Education of China (No. 20103218110017), a project funded by the Priority Academic Program
Development of Jiangsu Higher Education Institutions (PAPD), the Fundamental Research
Funds for the Central Universities, NUAA (No. NP2013307), Funding of Jiangsu Innovation
Program for Graduate Education KYLX_0287, the Fundamental Research Funds for the Central
Universities.
References
1. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: Proceedings of the
SOSP 2003, pp. 20–43 (2003)
2. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun.
ACM 51(1), 107–113 (2003)
3. Yang, H., Dasdan, A., Hsiao, R.L., et al.: Map-reduce-merge: simplified relational data
processing on large clusters. In: Proceedings of the 2007 ACM SIGMOD International
Conference on Management of Data. ACM, pp. 1029–1040 (2007)
4. Lämmel, R.: Google’s MapReduce programming model –Revisited. Sci. Comput. Program.
70(1), 1–30 (2008)
5. Apache Hive. http://hadoop.apache.org/hive/
6. Thusoo, A., Sarma, J.S., Jain, N., et al.: Hive: a warehousing solution over a map-reduce
framework. Proc. VLDB Endowment 2(2), 1626–1629 (2009)
7. Thusoo, A., Sarma, J.S., Jain, N., et al.: Hive-a petabyte scale data warehouse using
hadoop. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE),
pp. 996–1005. IEEE (2010)
8. Olston, C., Reed, B., Srivastava, U., et al.: Pig latin: a not-so-foreign language for data
processing. In: Proceedings of the 2008 ACM SIGMOD International Conference on
Management of Data, pp. 1099–1110. ACM (2008)
9. White, T.: Hadoop: the Definitive Guide. O’Reilly, Sebastopol, CA (2012)
10. Apache Hadoop. http://hadoop.apache.org
11. Murty, J.: Programming Amazon Web Services: S3, EC2, SQS, FPS, and SimpleDB.
O’Reilly Media Inc., Sebastopol, CA (2009)
12. Patten, S.: The S3 Sookbook: Get Cooking with Amazon’s Simple Storage Service. Sopobo
(2009)
13. Isard, M., Budiu, M., Yu, Y., et al.: Dryad: distributed data-parallel programs from
sequential building blocks. ACM SIGOPS Operating Syst. Rev. 41(3), 59–72 (2007)
14. Chaiken, R., Jenkins, B., Larson, P.Å., et al.: SCOPE: easy and efficient parallel processing
of massive data sets. Proc. VLDB Endowment 1(2), 1265–1276 (2008)
15. Yu, Y., Isard, M., Fetterly, D., et al.: DryadLINQ: a system for general-purpose distributed
data-parallel computing using a high-level language. In: OSDI, vol. 8, pp. 1–14 (2008)
16. Pike, R., Dorward, S., Griesemer, R., et al.: Interpreting the data: parallel analysis with
Sawzall. Sci. Programm. 13(4), 277–298 (2005)
17. DeWitt, D.J., Gerber, R.H., Graefe, G., et al.: A High Performance Dataflow Database
Machine. Computer Science Department, University of Wisconsin (1986)
18. Li, J., Srivastava, J., Rotem, D.: CMD: a multidimensional declustering method for parallel
database systems. In: Proceedings of the 18th VLDB Conference, pp. 3–14 (1992)
19. Chen, T., Xiao, N., Liu, F., et al.: Clustering-based and consistent hashing-aware data
placement algorithm. J. Softw. 21(12), 3175–3185 (2010)
320 Y. Shen et al.
20. Karger, D., Lehman, E., Leighton, T., et al.: Consistent hashing and random trees:
distributed caching protocols for relieving hot spots on the World Wide Web. In:
Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing.
ACM, pp. 654–663(1997)
21. DeCandia, G., Hastorun, D., Jampani, M., et al.: Dynamo: Amazon’s highly available
key-value store. In: SOSP, vol. 7, pp. 205–220 (2007)
Dynamic Data Driven Particle Filter
for Agent-Based Traffic State Estimation
1 Introduction
Traffic state estimation has become a research hotspot due to the increasingly serious
traffic congestion problems. It is difficult to evaluate traffic flow only by static data due
to its’ features of nonlinear, non-Gaussian and high dimension random [1]; The rapid
development of static and dynamic sensors on real road network allows us to get
real-time data that can reflect the critical nature of traffic flow, such as the vehicles’
average speed and density. The rational use of dynamic sensor data turns into the key to
improve the accuracy of traffic state estimation [2]. Bi Chang and Fan in [3] proposed a
particle filter based approach to estimate freeway traffic state using a macroscopic
traffic flow model, traffic flow’s speed and density collected by the measurement
detectors were assimilated into the model, experiment results proved that particle filter
had an encouraging estimation performance.
Particle filter (PF) is a data assimilation method used to solve the recursive state
estimation problem. Unlike Kalman Filter and its’ optimized algorithm, PF does not
depend on the assumption of linear or Gaussian noise, and is able to be applied for
various systems even with non-linear and non-Gaussian noise [7–9]. In PF algorithm,
weighted particles are used to describe the posterior distribution of system state, but
particle degradation usually happens after several iterations. Normally, the degradation
can be overcome by particle resampling, nevertheless it may result in particle ener-
vation which means particles with bigger weights being selected time after time, and
makes the particles lose diversity [6]. So how to prevent particle degradation and keep
particle diversity turns into the key to apply particle filter.
Agent-based modeling (ABM) is a relatively new approach to modelling system
composed of autonomous, interacting agents, and a way to model the dynamics of
complex adaptive systems [4]. In [5] an agent-based microscopic pedestrian flow
simulation model was presented, the comparison between simulation results and
empirical study results reveals that the model can approach the density-speed funda-
mental diagrams and the empirical flow rates at bottlenecks within acceptable system
dimensions. Considering the similarities between pedestrian flow and vehicle flow, it is
reasonable to model the traffic flow using agent-based method.
In this paper, we apply particle filter to agent-based traffic simulation model which
attracts few focus but the same important as the macro ones. The particle degradation
and enervation problems are solved based on the thought of dynamic data driven
application system (DDDAS). The traffic state on all roads can be estimated based on
the simulation states.
The rest of this paper is organized as follows. The general information about particle
filter is introduced in Sect. 2. Section 3 shows the agent-based traffic simulation model.
Section 4 describes the dynamic data driven particle filter in detail, the experiments and
results are presented in Sect. 5. Conclusions are drawn in the last section.
2 Related Work
st þ 1 ¼ f ðst ; tÞ þ ct : ð1Þ
mt ¼ gðst ; tÞ þ xt : ð2Þ
The core algorithm of the particle filter is sequential importance sampling (SIS).
In SIS, the algorithm undergoes plenty of iterations. Each iteration, the observation mt
ðiÞ
and the previous state of the system St1 are received, the sample set st1 2 St1 can be
ðiÞ ðiÞ
got from the proposal density qðst jst1 ; mt Þ to predict the next state, and then the
Dynamic Data Driven Particle Filter 323
weight of each particle will be updated as win ¼ fwin ; i ¼ 1; 2; . . .ng. The SIS method
suffers a serious drawback: after a few iteration steps, most of the normalized weights
are very close to zero, and the degeneracy phenomenon is unavoidable. Liu et al. in
[10] introduced an approximate method to measure particle degradation as shown in
Eq. (3).
X
N
^ eff ¼ ½
N ðwin Þ2 1 ð3Þ
i¼1
In order to solve the degradation problem, the resample algorithm is added after
weight normalization. In the resampling step, samples drawn with a probability pro-
portional based on the normalized sample weights are used to update the system state.
To sum up, the basic particle filter can be summarized as below.
Step 1. Particle generation: sample N particles from the predictive distribution.
Step 2. Weight computation: assume the N particles’ states at time step t are
st ; i ¼ 1; . . .; N, compute the importance weights wit ; i ¼ 1; . . .; N. Normalize the
i
_i P
weights according to wt / wit ½ Ni¼1 wit 1 .
Step 3. Resampling: based on the resampling algorithm, discard the low weight
particles in the particle set ½fsit ; wit gNi¼1 . Continue the estimation based on the new
particle set ½fsit ; wit gNi¼1 0 for a time step. Set t = t+1, return to Step 1.
vDv
s ðv; DvÞ ¼ s0 þ max 0; vT þ pffiffiffiffiffi ð5Þ
2 ab
SRNðt 1Þ ¼ fsDatat1 ½1; sDatat1 ½2. . .sDatat1 ½kg ¼ MMðRNðt 1ÞÞ ð6Þ
The dynamic state-space model is defined in Eq. (7), RN(t − 1) and RN(t) are the
statuses of traffic flow at time t − 1 and t respectively; EnMovSim is the improved
system model based on MovSim; PF is the assimilation model; RRN(t − 1) is the
sensor data which can represents the real roads’ states. γ(t) is the state noises.
plenty of simulation particles according to the basic information about the road network
and vehicles, the deviation between the simulation result and the real scene can be
narrowed by assimilate real-time data from “Sensor data management”. The data
assimilation method can optimize its’ parameters and execution strategy based on the
simulation result, the type and frequency of sensor data from “Sensor data manage-
ment” can be adjusted similarly. As the particle filter execute to certain precision, the
concerned traffic state can be estimated and the simulated agent-based traffic scene can
be displayed.
Weight computation
Control Info
accelerate the simulation process. When the simulation results reach a certain precision,
the assimilation data will be changed into comprehensive ones. Secondly, the scope of
the random value has a negative relationship with the average weight of the current
particle set. In other words, at the beginning of the simulation, there may have a big
difference between the simulation and the real scene, the simulation can be speed up by
adding relatively large random variables. During the process of the simulation
approaching to the real state, the random variables should be decreased to ensure the
stability of the simulation results. Lastly, if the particle set satisfies the diversity
requirement, which can be judged by N ^ eff , the resampling step will be ignored.
Otherwise, the resample will be executed based on the recalculated normalized weight
with N^ eff . The detailed steps are described in the next section.
At the beginning of the simulation, set b = 1, when PF execute to certain precision, set
b as in Eq. (10). senS½1k ðtÞ is the real average speed. simS½1k ðtÞ is the speed got from
the measurement model. If the value of a, b, or sum is 0, we values it as 0.1. The
weights are normalized as in Eq. (12).
X
K
sum ¼ jsimD½i ði; tÞ senD½i ðtÞj ð11Þ
i¼1
X
N
~ ði; tÞ ¼ wði; tÞ=
w wðj; tÞ ð12Þ
j¼1
Step 4. Calculate the effective number of particles in Eq. (13) and the average
^ eff > 0.4*N, go to Step 6, otherwise continue to Step 5.
weight in Eq. (14), if N
X
N
^ eff ¼ ½
N wði; tÞÞ2 1
ð~ ð13Þ
i¼1
X
N
ðtÞ ¼
w wði; tÞ=N ð14Þ
i¼1
Step 5. Resample: to solve the particle enervation problem, the normalized weight
should be recalculated as in Eq. (15). In [13], it has been proved that if 0\N ^ eff
c
\1,
then w~ ði; tÞ [ w
~ ~ ði; tÞ, which means the little weight of particle is increased, and thus
relieves the enervation problem. It can be proved that 0\N ^ eff
c
\1 in Eq. (16), and from
the equation we can conclude that the more diverse the particles are, the fewer particles
will be copied, thus keep the particles’ diversity. The replication sequence which saves
the particle number to be copied will be calculated in Eq. (17). Update the states of
particles based on the replication sequence.
^ c X
N
^ c
~~ ði; tÞ ¼ ½w
w ~ ði; tÞNeff = wðj; tÞNeff ; ð0\c\1Þ
½~ ð15Þ
j¼1
XN X
N
^ eff
N c
¼½ wði; tÞÞ2 c ;
ð~ w ^ eff
~ ði; tÞ ¼ 1; 0\c\1 ) 0\N c
\1 ð16Þ
i¼1 i¼1
328 X. Feng et al.
Xm1 Xm
~~ ðj,t)\ui
w ~~ ðj,t); ui ð1 ; 2 ; 3 . . . n 1 ; nÞ
w ð17Þ
j¼1 j¼1 n n n n n
Step 6. The proper random factors added in vehicles’ parameters are shown in
Eq. (18). Where rand(1, −1) means −1 or 1 randomly, Nðl; r2 Þ is a normal distribution
whose mean is l and variance is r.
0
P ðt; iÞ ¼ Pðt; iÞ þ f ð
wðtÞÞ randð1; 1Þ Nðl; r2 Þ ð18Þ
Fig. 3. Road network from the Ming Palace to Zhongshan Gate in Nanjing
Initialize a single-thread simulation based on the road network and vehicle infor-
mation, set a breakdown point at a random position of the marked road. The congestion
will occur due to the breakdown of the lane. Run the simulation for 300 s, all the
Dynamic Data Driven Particle Filter 329
vehicles’ parameters at each time will be recorded to act as the “real” sensor data.
Remove the breakdown point and restart the simulation with 100 simulation threads. In
this paper, SIS, UPF [14] and DPF are used to assimilate the “real” data respectively,
and the related results are shown below.
The N ^ eff in each algorithm are shown in Fig. 4, it can be concluded that SIS suffers
particle degradation problem, and the particles lose diversity in PF after the iteration of
resampling; UPF and DPF can keep the particles’ diversity within an appropriate range,
while DPF can reduce a lot of resampling operation. Figures 5 and 6 show the vehicles’
density and average speed among the marked road at 180 s, respectively, we can see
that the traffic simulation using DPF is the most close to real scene. The performance of
different particle filter algorithms is displayed in Table 3; it is obviously that DPF can
simulate the traffic flow accurately with faster speed, which indicate that the dynamic
data driven methods used in weight computation, resample and random model really
work to solve the particle degradation and enervation problem as well as speed up the
execution time.
Dynamic Data Driven Particle Filter 331
5 Conclusions
In this paper, the dynamic data driven PF for traffic state estimation is presented, the
agent-based traffic model and dynamic data driven application are described in detail.
We solved the particle degradation and enervation problem by optimizing the execute
strategy of particle filter in weight computation, resample, and random model
dynamically. A road network from the Ming Palace to Zhongshan Gate in Nanjing is
used as an experimental object, the experiment indicates the proposed particle filter can
keep the diversity of the simulation particles, and the proposed framework can estimate
traffic state effectively with faster speed.
References
1. Zhang, G., Huang, D.: Short-term network traffic prediction with ACD and particle filter. In:
2013 5th International Conference on Intelligent Networking and Collaborative Systems
(INCoS), pp. 189–191 (2013)
2. Herrera, J.C., Bayen, A.M.: Traffic flow reconstruction using mobile sensors and loop
detector data. University of California Transportation Center Working Papers (2007)
3. Corporation, H.P.: Particle filter for estimating freeway traffic state in Beijing. Math. Probl.
Eng. 70, 717–718 (2013)
4. Macal, C.M., North, M.J.: Tutorial on agent-based modelling and simulation. J. Simul. 4
(112), 151–162 (2010)
5. Liu, S., Lo, S., Ma, J., Wang, W.: An agent-based microscopic pedestrian flow simulation
model for pedestrian traffic problems. IEEE Trans. Intell. Transp. Syst. 15, 992–1001 (2014)
6. Wang, Z., Liu, Z., Liu, W.Q., Kong, Y.: Particle filter algorithm based on adaptive
resampling strategy. In: 2011 International Conference on Electronic and Mechanical
Engineering and Information Technology (EMEIT), pp. 3138–3141 (2011)
7. Arulampalam, S., Maskel, S., Gordon, N., Clapp, T., Arulampalam, S., Maskel, S., Gordon,
N., Clapp, T.: A tutorial on PFs for on-line non-linear/non-gaussian bayesian tracking. Sci.
Program. 50, v2 (2002)
8. Chowdhury, S.R., Roy, D., Vasu, R.M.: Variance-reduced particle filters for structural
system identification problems. J. Eng. Mech. 139, 210–218 (2014)
9. Ruslan, F.A., Zain, Z.M., Adnan, R., Samad, A.M.: Flood water level prediction and
tracking using particle filter algorithm. In: 2012 IEEE 8th International Colloquium on
Signal Processing and its Applications (CSPA), pp. 431–435 (2012)
10. Liu, J.S.: Metropolized independent sampling with comparisons to rejection sampling and
importance sampling. Stat. Comput. 6, 113–119 (1996)
11. Kesting, A., Treiber, M., Helbing, D.: Enhanced intelligent driver model to access the impact
of driving strategies on traffic capacity. Philos. Trans. R. Soc. 368, 4585–4605 (2009)
12. Darema, F.: Dynamic data driven applications systems: a new paradigm for application
simulations and measurements. In: Bubak, M., van Albada, G.D., Dongarra, J., Sloot, P.M.A.
(eds.) ICCS 2008, Part III. LNCS, vol. 3038, pp. 662–669. Springer, Berlin, Heidelberg (2004)
13. Zhu, J., Wang, X., Fang, Q.: The improved particle filter algorithm based on weight
optimization. In: 2013 International Conference on Information Science and Cloud
Computing Companion (ISCC-C), pp. 351–356 (2013)
14. Merwe, R.V.D., Doucet, A., Freitas, N.D., Wan, E.A.: The unscented particle filter. In: Nips,
pp. 584–590 (2000)
An Improved Dynamic Spectrum Access
Scheme in Cognitive Networks
1 Introduction
The spectrum has become a scarce resource and cognitive technology as a prospective
approach has been proposed to overcome the trouble by achieving dynamic spectrum
access [1]. In a cognitive network, the owners of spectrums are called primary users
(PUs) and the users that have to apply spectrums from PUs are secondary users (SUs).
Literatures introduced the sharing spectrum between PUs and SUs [2–4].
Recently, solving dynamic spectrum access problem by economical tools has
become more and more popular [5]. There is a non-profit controller called spectrum
broker who hosts the auction activities [6]. Tao et al. discussed the online auction based
on the relay selection scheme [7]. H. Ahmadi et al. tried to put learning strategy into the
auction mechanism to get satisfactory results [8], but its algorithms complexity is high
and we can only get a better one at last. Wei et al. considered various strategies is
selected by different SUs and this can bring better revenue as well as longer delay time
[9]. Peng wanted to maximize the trading volume by a price system [10], when the
spectrum PU leased more than SUs needed, he try to cut the price and vice versa,
obviously, it’s too completed. Qinhui took the location factors into auction activity and
began to care about the QoS [11]. The situation taken into account by Mohammad, the
real environment has many malicious users and they designed a real game between SUs
in the auction process [12], but it is difficult to reach the Nash equilibrium. Dejun et al.
also studied the truth about SUs’ biddings information and clustering management SUs
to give the mechanism a better scalability [13]. Xinyun broke the time and frequency
into several littler cells and SUs bidded for the amount of the cells and proposed a
scheme named Norm-based Greedy Scheme [14]. Joseph et al. tried to put the dynamic
spectrum allocation as knapsack problems and aimed at maximizing spectrum usage
while lowering management overhead [15]. Based on this, Changyan ameliorated the
idea in three ways: rearrangement, interchange and replacement [16]. But none of those
papers have taken the location factor into account, using genetic algorithms to solve it
in polynomial time and trying to minimize the interference during the allocation part.
In this paper, we model dynamic spectrum access as a knapsack problem. For all
the SUs who bid the spectrum from PU have only two statues: 1 denotes a winner for
the spectrum and 0 denotes the opposite. When SUs bid for the spectrum, they also
send their private information like location to the spectrum broker, and then the broker
determines all the winners. A payment rule is proposed to assure that every SU can win
the spectrum only by a truthful bid. Next step is allocating all the free channels to the
winners.
The main contributions of this paper are as follows:
• The dynamic spectrum access problems have been modelled as knapsack problems.
Then Genetic Algorithms are selected to solve it in polynomial time. Modified
Genetic Algorithms make it more suitable for solving knapsack problems.
• Adopt Second-price scheme to determine the payment prices of winners.
• Allocation the free channels to all the winners, trying to minimize the interference
according to the location information.
The rest of this paper is organized as follows. In Sect. 2, we propose the system
model. In Sect. 3, we illustrate the details of the spectrum access scheme and prove the
optimize payment rule can assure the networks’ truth. Section 4 presents the simulation
results and Sect. 5 concludes our paper.
2 System Model
For simplicity we consider a cognitive network with only one PU, multiple SUs and a
central entity named spectrum broker. PU allows SUs access to idle channels for the
extra profit. In an auction, the number of auctioned channels provided from PU is
denoted as m, maybe the bandwidths of channels are various but, for the sake of
simplicity, we assume they are the same. We denote the SUs by N ¼ f1; 2; 3; . . .; ng
and the spectrum broker is non-profit. SUs have different demands for spectrums
and calculates the value themselves for their needed spectrum, denoted by
V ¼ fv1 ; v2 ; . . .; vn g. Their bidding prices can be denoted by B ¼ fb1 ; b2 ; . . .; bn g, and
their demands can be denoted by D ¼ fd1 ; d2 ; . . .; dn g. We call the network is truth
334 Q. Du and Y. Wang
when vi ¼ bi . For conveniences, the SUs can only get all their requirements or nothing.
At the start of the auction, PU and SUs submit their own information to the
spectrum broker. The spectrum broker collects all these information and determines an
optimal allocation scheme. After all, the broker figures out the payments and payoff for
SUs and PU.
In this section, the improved truthful spectrum access scheme with a novel spectrum
auction and allocation mechanism is discussed in detail. We first formulate the winner
determination problem (WDP) as a knapsack problem and use genetic algorithms to
solve it. Also we modified the genetic algorithms to adapt to the spectrum auction we
proposed. After that, a pricing scheme like Second-Price Auction is designed to ensure
that all SUs can have a non-negative utility only by their real information. Of course,
we do a simple proof later. Finally, we try to allocate the freed channels to all the
winners in a way with lower interference as possible.
vi ¼ ri di log2 ð1 þ gi Þ ð1Þ
Here ri is the currency weight and gi means the signal-to-noise ratio (SNR). Both
of them are constants for every sui [17]. Before the auction, PU reports its information
to the spectrum broker. SUs send their bids also, which can be denoted as Bidi ,
i ¼ 1; 2; . . .; n. Each bid Bidi is specified as a 2-tuple ðbi ; di Þ, where
• di is the number of channels sui demands: di ¼ dki =we, where ki is the spectrum
demand of the bidder sui and w is the bandwidth of each channel.
• bi indicates the payment sui is willing to pay for its demand di .
After obtaining the complete information, the spectrum broker formulates an
optimization problem to determine the winners to maximize the social welfare, which
means maximize the total payment price from all winning SUs. The formulated opti-
mization problem is
X
n
max qi x i ð2Þ
fxi ;8i2ng
i¼1
An Improved Dynamic Spectrum Access Scheme 335
X
n
ð1Þ qi x i m
i¼1
ð2Þ xi 2 f0; 1g
Where xi ¼ 1 means sui win the spectrum leased by PU in the auction and xi ¼ 0,
otherwise. The first constraint means that the total amount of channels SUs obtain
cannot exceed the quantity leased from PU. When the winner determination problem
has been solved, a novel payment function was designed. In simple words, the winners’
payment is a function depend on the second highest price, so the computational method
of qi is
pffiffiffiffi bk
qi ¼ di max ðpffiffiffiffiffiÞ ð3Þ
k2Bi dk
And we will pintroduce it in details later in the payment rule part. Bi is a subset of
ffiffiffiffi
SUs whose bi di values are no bigger than SUi . Nevertheless the winner determi-
nation problem is difficult to solve, so in the next subsections, we will introduce how to
use Genetic Algorithms to solve it.
It is such a challenge to solve the problem shows in formula (2) in polynomial time.
In this paper, the puzzle has been modelled as knapsack problems and the Genetic
Algorithms is selected as the near-optimal solution. It has been proved that the Genetic
Algorithms have a better performance in solving knapsack problems. And we can get a
solution not worse than the traditional Greedy Algorithm.
• If one chromosome is not a solution, drop the bidder from the one with minimum
pffiffiffiffi
bi di value, which means search the states of SU from the tail of the inequation
(4).
• Do not allow the same chromosome in one population, which can prevent con-
vergence to the current highest fitness chromosomes.
• Modify fitness function to make it more suitable for the new payment rule.
And then, the principal steps of improved Genetic Algorithms are listed:
(a) Start: Initialize the first generation of population.
(i) Get one chromosome by Greedy Algorithm.
Sort the bidders according to the under standard, inequality (4).
b1 b2 bi bn
pffiffiffiffiffi pffiffiffiffiffi pffiffiffiffi pffiffiffiffiffi ð4Þ
d1 d2 di dn
(ii) Randomly generate all the other chromosomes in the first generation.
(b) Inspection: Check every chromosome in the population whether is a real solution.
If the SUs’ total demand is bigger than the pack size, then drop bidders from the
tail of the inequation (4).
(c) Fitness: Calculate the fitness value of each chromosome in the population. The
fitness function of ith chromosome in the population can be got by
X
n pffiffiffiffi bk
fitnessðiÞ ¼ ðxl dl max ðpffiffiffiffiffiÞÞ ð5Þ
l¼1
k2Bl dk
In step Inspection, checking all the chromosomes in the population whether they
are really solutions should be the first. If the total demand of a chromosome is bigger
than m, search the states of SU from the tail of the inequation (4), if the state is 1,
change it to 0 until the total demand does not exceed m.
The next step is calculating the fitness. Different from the traditional simple sum-
mation, we create a new formula for a better Preasonability with the newly payment rule.
We use the sum of the payoff function ð qðiÞÞ as the chromosome’s fitness value
P i2winners
instead of the biddings’ ð vi Þ. qðiÞ can be computed out by Eq. (3). The fitness
i2winners
values of chromosomes are the basis of the next step Selection.
In the Selection part, we traverse the current population and choose the different top
two high fitness chromosomes to be used for generating the new solutions.
Not a simple crossover operation like the traditional Genetic Algorithm, we select
different strategies to generate a new population. Firstly, or operation between the top
two chromosomes can bring one new solution. Secondly, crossover operation between
the top two can bring another two chromosomes. Finally, choice chromosomes in the
current population according to the fitness value descending to join the new population.
It is worth noting that during this step, the same chromosome is not allowed in the new
population. In order to facilitate understanding, we give the code.
338 Q. Du and Y. Wang
• If suj wins the auction, and its demand is dj and Bj 6¼ ;, the payment qj can be
represented as
pffiffiffiffi bk
qj ¼ dj max ðpffiffiffiffiffiÞ ð6Þ
k2Bj dk
Since that the auction scheme can guarantee the network’s truthful, we will do a
simple proof for it. That is to say that the spectrum auction scheme could maximize the
social welfare on the basis of two basic constraints: incentive compatibility and indi-
vidual rationality.
The auction scheme has incentive compatibility, which means each SU can get a
non-negative only by truthful bidding. We only need to consider the following two
cases.
(a) suj takes part in the auction by its truthful bidding ðvj ¼ bj Þ can get the utility
uj 0. i.e., (i) suj loses the auction and its utility uj ¼ 0; (ii) suj wins the auction
and gets the utility:
pffiffiffiffi bk
u0j ¼ vj qj ¼ vj dj max ðpffiffiffiffiffiÞ 0 ð7Þ
k2Bj dk
(b) suj takes part in the auction untruthfully ði:e:; vj 6¼ bj Þ, and its utility would be
affected only if its bidding price is larger than the real value. If suj loses the
auction, we define its utility is uj ¼ 0, otherwise, the only condition suj may have
pffiffiffiffi pffiffiffiffiffi pffiffiffiffi
the chance to win is bj dj maxk2Bj ðbk dk Þ vj dj . In this case, its
utility can be prove to be non-positive as
pffiffiffiffi bk pffiffiffiffi bj
u0j ¼ vj q0j ¼ vj dj max ðpffiffiffiffiffiÞ\vj dj pffiffiffiffi \0 ð8Þ
k2Bj dk dj
In conclusion, suj can’t increase its utility by an untruthful bidding no matter how
much bigger it is.
The auction scheme has individually rational, that is to say our scheme should
guarantee each truthful SU would have a non-negative utility. If suj loses the auction,
its utility is zero. Otherwise, suj wins and its utility can be calculated as:
pffiffiffiffi bk bj bk pffiffiffiffi
uj ¼ v j qj ¼ bj dj max ðpffiffiffiffiffiÞ ¼ ½pffiffiffiffi max ðpffiffiffiffiffiÞ dj 0 ð9Þ
k2Bj dk dj k2Bj dk
The above non-equality would happen when suj participates in the auction process
truthfully and wins. End here, the payment rule can obtain the biggest social welfare on
the basis of two basic: incentive compatibility and individual rationality.
340 Q. Du and Y. Wang
4 Numerical Results
In this part, we do some simulations to evaluate the improved genetic algorithm in the
su1
su2
su4
su5
su3
su7
su8
su6
su1
su2
su4
su1 su4 su5 su8 su3
su5
su2 su3 su6 su7 su6
su8
su7
auction step. For comparison purpose, the greed algorithm is simulated as a benchmark.
The considered cognitive radio network consists of 20 SUs and one PU. For the
PU, the number of auctioned channels it offers is randomly selected in [20, 50] Chs.
For each SU i; 8i 2 N, its spectrum requirement is randomly in [5, 50], while its SNR ci
is random in [50, 100]. All the results are based on average over 1000 runs.
Figure 2 shows the social welfare with different solutions to WDP. It shows that the
improved genetic algorithms can improve the initial population by greedy algorithms.
Which means that genetic algorithms could not worse than greedy. Moreover, the
improved genetic algorithms did a better job as we see from the figure.
Figure 3 shows the SUs’ satisfaction ratio of the improved genetic algorithms,
genetic algorithms and greedy algorithms. From the figure, we can know that a higher
An Improved Dynamic Spectrum Access Scheme 341
SUs’ satisfaction ratio could be gotten by the improved genetic algorithms. In other
words, more SUs’ could get spectrum.
5 Conclusion
In this paper, we proposed a truthful spectrums auction between one PU and multiple
SUs. The auction processes are regarded as a knapsack problem, and the genetic
algorithm is marked to solve it. In order to improve the efficiency of the genetic
algorithms, we did the improvement in four aspects. By now, all the winners can be
determined with a higher income. We designed a payment rule to ensure the authen-
ticity of the auction. A simple proof has also been given for the truthful and
342 Q. Du and Y. Wang
economically robust. Numerical results show that the auction algorithm has a good
performance in solving the winner determination problem. In our paper, there are many
aspects that can be improved such as the hidden-node problem mentioned in [21]. In
the future study, a more comprehensive consideration should be taken into the auction
framework to have a higher practicability.
References
1. Haykin, S.: Cognitive radio: brain-empowered wireless communications. IEEE J 23,
201–220 (2005)
2. Liang, W., Xin, S., Hanzo, L.: Cooperative communication between cognitive and primary
users. Inst. Eng. Technol. 7, 1982–1992 (2012)
3. Zhai, C., Zhang, W., Mao, G.: Cooperative spectrum sharing between cellular and ad-hoc
networks. IEEE Trans. Wirel. Commun. (2014)
4. Ng, S.X., Feng, J., Liang, W., Hanzo, L.: Pragmatic distributed algorithm for spectral access
in cooperative cognitive radio networks. IEEE Trans. Commun. (2014)
5. Rui Wang, Hong Ji, Xi Li.: A novel multi-relay selection and power allocation scheme for
cooperation in cognitive radio ad hoc networks based on principle-agent game. Inf.
Commun. Technol. (2013)
6. Song, M., Xin, C., Zhao, Y., Cheng, X.: Dynamic spectrum access: from cognitive radio to
network radio. IEEE Wirel. Commun. 19(1), 23–29 (2012)
7. Xu, H., Jin, J., Li, B.: A secondary market for spectrum. IEEE Infocom (2010)
8. Jing, T., Zhang, F., Cheng, W., Huo, Y., Cheng, X.: Online auction based relay selection for
cooperative communications in CR networks. In: Cai, Z., Wang, C., Cheng, S., Wang, H.,
Gao, H. (eds.) WASA 2014. LNCS, vol. 8491, pp. 482–493. Springer, Heidelberg (2014)
9. Ahmadi, H., Chew, Y.H., Reyhani, N., Chai, C.C., DaSilva, L.A.: Learning solutions for
auction-based dynamic spectrum access in multicarrier systems. Comput. Netw. 60–73
(2014)
10. Zhong, W., Xu, Y., Wang, J., Li, D., Tianfield, H.: Adaptive mechanism design and game
theoretic analysis of auction-driven dynamic spectrum access in cognitive radio networks.
EURASIP J. Wirel. Commun. Networking (2014)
11. Lin, P., Zhang, Q.: Dynamic spectrum sharing with multiple primary and secondary users.
IEEE Trans. Veh. Technol. (2011)
12. Wang, Q., Ye, B., Lu, S., Gao, S.: A truthful QoS-aware spectrum auction with spatial reuse
for large-scale networks. IEEE Trans. Parallel Distrib. Syst. 25(10), 2499–2508 (2014)
13. Alavijeh, M.A., Maham, B., Han, Z., Nader-Esfahani, S.: Efficient anti-jamming truthful
spectrum auction among secondary users in cognitive radio networks. In: IEEE
ICC-Cognitive Radio and Networks Symposium (2013)
14. Yang, D., Xue, G., Zhang, X.: Truthful group buying-based auction design for cognitive
radio networks. In: IEEE ICC-Mobile and Wireless Networking Symposium (2014)
15. Wang, X., Sun, G., Yin, J., Wang, Y., Tian, X., Wang, X.: Near-optimal spectrum allocation
for cognitive radio: a frequency-time auction perspective. In: GLOBECOM-Wireless
Communication Symposium (2012)
16. Mwangoka, J.W., Marques, P., Rodriguez, J.: Broker based secondary spectrum trading. In:
6th International ICST Conference on Cognitive Radio Oriented Wireless Networks and
Communications (2011)
An Improved Dynamic Spectrum Access Scheme 343
17. Yi, C., Cai, J.: Combinatorial spectrum auction with multiple heterogeneous sellers in
cognitive radio networks. In: IEEE ICC-Cognitive Radio and Networks Symposium (2014)
18. Gao, L., Xu, Y., Wang, X.: MAP: Multiauctioneer progressive auction for dynamic
spectrum access. IEEE Trans. Mob. Comput. 10(8), 1144–1161 (2011)
19. Mochon, A., Saez, Y., Isasi, P..: Testing bidding strategies in the clock-proxy auction for
selling radio spectrum: a genetic algorithm approach. In: IEEE Evolutionary Computation
(2009)
20. Sachdeva, C, Goel, S.: An improved approach for solving 0/1 Knapsack Problem in
polynomial time using genetic algorithms. In: IEEE International Conference on Recent
Advances and Innovations in Engineering (2014)
21. Liu, L., Li, Z., Zhou, C.: Backpropagation-based cooperative location of primary user for
avoiding hidden-node problem in cognitive networks. In: International Journal of Digital
Multimedia Broadcasting, Hindawi Publishing Corporation (2010)
The Design and Implementation of a Dynamic
Verification System of Z
1 Introduction
of the Z model. These two categories of analysis methods are both unable to automat-
ically verify the temporal properties of a given Z model.
Model checking [12], or property checking, is a technology dealing with the fol-
lowing problem: given a model of a state transition system, exhaustively and auto-
matically check whether this model meets a given formula. Usually, the formula
describes temporal properties such as safety and liveness [13]. Because of its con-
ciseness and efficiency, model checking technology has been widely adopted to verify
temporal properties of hardware and software systems for over three decades.
Because Z lacks the ability of describing system temporal behavior, building a
proper state transition system is the primary step to study the model checking method
on Z. A state transition system based on FA (Finite Automata) can describe the
run-time behavior of software, but it is insufficient in describing data constraints,
comparing with Z’s syntax based on set theory and first-order logic. Some of the
current studies on model checking Z are based on transforming Z to a middle language
that can be model checked by existing tools [14]. Other related works use new
structures such as ZIA [15, 16], which is a combination model of Z notation and
interface automata targeted at component based systems. By combining Z and state
transition structures, we can establish a more comprehensive system model, and study
its model checking method accordingly. However, the existing researches have not
implemented automatic model transformation between Z and those hybrid models,
which is insufficient for industrial usage.
In this paper, we design and implement a prototype system ZDVS. Firstly, we
define a formal model ZA (Z-Automata) combining Z and FA (Finite Automata). The
generation algorithm from the basic structures of Z to ZA is studied to enhance the
practical usage of our hybrid model. Further on, a model checking algorithm ZAMC is
proposed to automatically verify the temporal/data constraints within the structure and
behavior specified by ZA. Finally, a case study is used to illustrate the correctness and
feasibility of ZDVS.
2 ZA Model
ZA ¼ ðS; S0 ; R; d; F; MÞ ð2Þ
3 Model Checking ZA
In this section, we define a set of temporal logic formulas called ZATL (Z-Automata
Temporal Logic) to describe the expected temporal properties of the system. Further
on, a model checking algorithm called ZAMC (Z-Automata Model Checking) is
proposed to perform the verification towards such properties.
The first four formulas represent the composition mode of u. The last four formulas
are temporal logic formulas used to describe properties of system.
A h / denotes that all the states in the system satisfy u. A u denotes that there is
at least one state satisfying u in every path of the system. E h u denotes that there is at
least one path in which all the states satisfy u. And E u denotes that there is at least
one state in the system satisfying u. These formulas can be used to describe the
dependability properties of system such as liveness and safety.
Step 2.1. If initial state s0 satisfies u, the system satisfies formula φ’. Verification
process ends, and output true;
Step 2.2. Otherwise, put s0 into queue Q;
Step 2.3. While Q is not empty, get state s from Q;
Step 2.4. If s doesn’t satisfy u, update current searching path. If s 2 F, we find a
counterexample. So the verification ends up with false;
Step 2.5. If s satisfies u, put all unprocessed successor states of s into Q;
Step 2.6. Go to Step 2.3, until Q is empty;
Step 2.7. If we find a counterexample, output the counterexample according to
searching path. Otherwise, output true.
Based on the ZA model and the ZAMC algorithm, we realize a prototype system called
ZDVS (Z Dynamic Verification System). In this section, we present its framework and
procedure. A case study is used to illustrate the correctness and effectiveness of our
method.
The Design and Implementation 351
input , where ,
output ?
Initialize checkpath
for each
case
for each
if continue;
else result s; break;
end for
result
case
if result
else result -1
<queue>
while !QueueEmpty(Q)
<queue> s
if then
if result ; break;
for each if is not visited then
<queue>
end while
end if
if (result = -1) result
case
if then result ; break;
else result -1
<queue> ; update checkpath
while !QueueEmpty(Q)
<queue> s
if then
if result
for each
if is not visited then <queue> ; update
checkpath
end while
end if
if (result = -1) result counterexample according to
checkpath
case
for each
if result
else continue;
result
return result;
1
op
op
void op1(int ,int )
op
2
1
<op1>Op { ZA model
Z notation generation
parser ; algorithm
state2 op1
state3
}
op
<op2>Op {
state4
;
}
... ...
Z Specification ZAstatic Model ZA Model
ZA|=Φ ? false
counterexample -> state0:(0,0)-> state2:(0,1)-> state4:(0,2)-> state6:(0,3)-> state10:(0,5)-> state11:(1,5)->
state12:(2,5)-> state13:(3,5)-> state14:(4,5)
The case study contains 4 schemas as shown in Fig. 4, including a state schema and
3 operation schemas. The corresponding ZA model generated by ZDVS contains 30
states and 3 actions.
sysState decVar1Op
var1: sysState
var2: var1 2
var1 0 var1 4 var1 =var1-1
var2 0 var2 5
incVar2Op incVar1Op
sysState sysState
var2 =var2+1 var1 =var1+1
We use 4 temporal logic formulas as inputs of the model checking process, which
are listed as follows.
A h var1 3, denotes that var1 in all the states of the system is equal or greater than 3.
A var1 [ 4, denotes that for every path of the system there is at least one state in
which var1 is greater than 4.
E h var1 var2, denotes that there is at least one path in which all the states satisfy
var1 ≥ var2.
A var1 ≥ var2 && var2 > 2, denotes that there is at least one state in the system
satisfying var1 ≥ var2 && var2 > 2.
Figure 5 shows the verification result of formula A var1 > 4. Since the result is
false, the system outputs a counterexample.
Illustration on the case study shows that the proposed ZA model can enhance the
descriptive power by combining Z notation and FA. The prototype system ZDVS is
able to correctly parse the input Z specification and generate the corresponding ZA
model, accept and analyze the input ZATL formulas, and verify the system temporal
properties automatically and effectively.
In order to verify the temporal properties of Z model, this paper designs and imple-
ments a prototype system ZDVS to perform the model transformation and model
checking on a proposed hybrid software model ZA. Our main contributions are as
follows:
1. A formal software model called ZA is defined by combining Z notation and FA. The
proposed ZA model can specify not only static structure and operation specifications,
but also temporal constraints of the targeted system. A generation algorithm is
designed to build ZA model from ZAstatic, a simplified structure of Z specification.
354 J. Wang et al.
References
1. ISO I. IEC 13568: 2002: Information technology–Z formal specification notation–Syntax,
type system and semantics. ISO (International Organization for Standardization), Geneva,
Switzerland (2002)
2. SPIVEY J M. The Z notation: a reference manual. In: International Series in Computer
Science. Prentice-Hall, New York, NY (1992)
3. Wordsworth, J.B.: Software Development with Z: A Practical Approach to Formal Methods
in Software Engineering. Addison-Wesley Longman Publishing Co. Inc, Reading (1992)
4. Ince, D.C., Ince, D.: Introduction to Discrete Mathematics, Formal System Specification,
and Z. Oxford University Press, Oxford (1993)
The Design and Implementation 355
5. Abrial, J.-R., Schuman, S.A., Meyer, B.: Specification language. In: McKeag, R.M.,
Macnaghten, A.M. (eds.) On the Construction of Programs: An Advanced Course.
Cambridge University Press, Cambridge (1980)
6. Z/EVES Homepage (2015). http://z-eves.updatestar.com/
7. ProofPower Homepage (2015). http://www.lemma-one.com/ProofPower/index/
8. Martin, A.P.: Relating Z and first-order logic. Formal Aspects Comput. 12(3), 199–209
(2000)
9. ProB Homepage (2015). http://www.stups.uni-duesseldorf.de/ProB/index.php5/Main_Page
10. Jaza Homepage (2005). http://www.cs.waikato.ac.nz/*marku/jaza/
11. Zlive Homepage (2015). http://czt.sourceforge.net/zlive/
12. Clarke, E.M., Grumberg, O., Peled, D.: Model Checking. MIT Press, Cambridge (1999)
13. Alpern, B., Schneider, F.B.: Recognizing safety and liveness. Distrib. Comput. 2(3), 117–126
(1987)
14. Smith, G.P., Wildman, L.: Model checking Z specifications using SAL. In: Treharne, H.,
King, S., Henson, M.C., Schneider, Steve (eds.) ZB 2005. LNCS, vol. 3455, pp. 85–103.
Springer, Heidelberg (2005)
15. Cao, Z., Wang, H.: Extending interface automata with Z notation. In: Arbab, F., Sirjani, M.
(eds.) FSEN 2011. LNCS, vol. 7141, pp. 359–367. Springer, Heidelberg (2012)
16. Cao, Z.: Temporal logics and model checking algorithms for ZIAs. In: 2010 2nd
International Conference on Proceedings of the Software Engineering and Data Mining
(SEDM). IEEE (2010)
Maximizing Positive Influence in Signed Social
Networks
1 Introduction
Recently, online social networks have developed rapidly. They are becoming epidemic
platforms of interaction and communication in our daily life. Due to their information
spreading capability, the social networks are widely applied to viral marketing. In viral
marketing, the merchants usually select a certain number of influential users and other
people could be influenced by them because of the word-of-mouth effect in the social
networks. The Influence Maximization (IM) problem is how to find those influential
users to maximize the influenced users in social networks.
Usually, a social network is denoted as a graph G ¼ ðV; E Þ, each node in V rep-
resents a user and each edge in E represents relationships between users. IM problem
aims to find K nodes as a seed set so that the number of nodes reached by influence
spread is maximized under a propagation model. IC (Independent Cascade) model and
LT (Linear Threshold) model are two classical propagation models, which are proposed
by Kempe et al. [1]. IM problem is considered as a discrete optimization problem [1]
and has proved to be a NP-hard problem. It can achieve an approximate optimal
solution within (1 − 1/e) ratio by using greedy algorithm as the influence function is
monotonous and submodular. Nevertheless, the greedy algorithm is inefficient and has
serious scalability problem. To improve the efficiency and scalability, a lot of improved
greedy or heuristic algorithms have been proposed in [6–9, 13, 19, 20].
Although many works have been done on IM problem, mostly of them only
consider whether the nodes are activated or not, paying no attention to the opinions of
nodes. However, as the users can hold positive or negative opinion, the activated users
can also be activated with positive or negative opinion, which makes different in viral
marketing. Thus, the opinions of nodes should be considered. Meanwhile, as the
relationship between the users is different, the impact a user’s opinion can have on
others is also different. Therefore, the relationship between nodes should be measured
to determine whether an activated node will hold positive or negative opinion.
In view of the above problems, we define the PIM problem in signed social net-
works and extend the influence propagation model. The signed social network is a
network in which edge weight can denote relationship between nodes. In signed social
networks, the relationships between nodes can be positive, which denotes trust, or
negative, which denotes distrust [13]. In signed social network, if a node trusts another
node, it will hold the same opinion with the other node, and if a node distrusts another
node, it will hold the opposite opinion with the other node.
In this paper, the contributions that we made as follows:
• We propose a Linear Threshold with Attitude (LT-A) model for PIM problem in
signed social networks, defining an attitude function to calculate the attitude weight
of the node.
• We demonstrate that PIM problem under LT-A model is NP-hard. We also prove
the influence function of PIM problem under the LT-A model is monotonous and
submodular, this allows a greedy algorithm to get an approximation optimal solu-
tion within (1 − 1/e) ratio.
• We conduct experiments on two real signed social network datasets, Epinions and
Slashdot.
The rest of this paper is organized in the following. In Sect. 2, we discuss related
work. In Sect. 3, we propose our influence propagation model, LT-A. And then, we
prove the influence function is monotonous and submodular. In Sect. 4, we first give
the definition for the PIM problem in signed social networks, and then prove it is a
NP-hard problem. After that, we give the greedy algorithm. In Sect. 5, we present our
experiments and give the analysis and result of our experiments. In Sect. 6, we give our
conclusions.
2 Related Work
Influence maximization problem is first studied by Domingos and Richardson [2, 3],
they model the problem with a probabilistic framework and use Markov random fields
method to solve it. Kempe et al. [1] further formulate the problem as a discrete opti-
mization problem. They prove that the problem is NP-hard and propose two basic
models: IC model and LT model. They also present a greedy algorithm to get the
358 H. Wang et al.
approximate optimal solution. But the efficiency of the algorithm is low and its scal-
ability is not good. To solve these problems, a lot of effective algorithms are proposed.
Leskovec et al. [4] put forward Cost Effective Lazy Forward (CELF) algorithm
which significantly shorten the execution time by 700 times. Goyal et al. [5] further
improve the original CELF algorithm to CELF++. In addition, [6–9] also shorten the
execution time by improving the efficiency of the greedy algorithm or proposing
heuristic algorithms.
However, most works on PIM problem make the assumption that all users hold
positive attitude, but in reality some users may hold negative attitude, For example,
some people will buy iPhone 6 because they like its appearance, but others will hold
negative attitude because iPhone 6 is found vulnerable to bend, they may not buy them.
If the negative relationships are ignored, the result of influence maximization will be
over-estimated. Therefore, negative attitude should also be taken into account in PIM
problem. Chen et al. [10] first extend IC model to IC-N, adding a parameter called
quality factor to reflect negative opinions of users. There are several deficiencies in
their model. Firstly, in their model, the activated node holds the same opinion with the
node which activates it. But in real life, a user may not trust another user, so when a
node is activated by the other nodes, it may not hold the same opinion as the node
activating it. Secondly, in [10], they assume that each node holds the same opinion all
the time, but users may change their opinions when they are affected by others. Thirdly,
in [10], they ignore individual preference. Actually, when considering individual
preference, the degree that people affected by others is different as people may have
their own opinion towards a product.
In [11], the personal preference is taken into account. The model in [11] is on the
basis of that the opinions users can have are positive, neutral and negative. They
propose a two-phase influence propagation model, OC model (opinion-based cascading
model).
In [12], considering the opinion modification problem, they proposed an extended
LT model with instant opinions. Besides, to express the different influence between
users, they define a trust threshold in the model to justify whether a node follows or
opposes the attitude from its neighbors. However, the trust threshold they defined is
equivalent for each edge in the graph, while actually different neighbors will give
different influence.
Considering the above problems, Li et al. study the IM problem in signed social
networks [13]. In signed social networks, instead of comparing the influence weight
with trust threshold to determine if two nodes hold trust or distrust relationship, there is
a weight on each edge, 1 and −1, 1 denotes one node trust the other node, and -1
denotes distrust [14]. By using signed social networks, Li et al. [13] propose an
extended model based on IC model, Polarity-related Independent Cascade (IC-P)
model. They considered polarity related influence maximization problem. In this paper,
we will focus on positive influence maximization problem as in viral marketing pos-
itive influence maximization is more important.
Maximizing Positive Influence in Signed Social Networks 359
In this section, we represent how to formulate a signed social network as a graph, and
then propose the Linear Threshold model with Attitude (LT-A).
Value of qðu;vÞ is 1 or −1, qðu;vÞ ¼ 1 denotes the node v will follow the attitude of
node u, qðu;vÞ ¼ 1 denotes the node v will against the attitude of node u. hv denotes
threshold, hv 2 ½0; 1, gv denotes the attitude weight, gv P
2 ½1; 1.
In the propagation process, v can be activated if u2Atv xðu; vÞ hv , η can be
calculated by
X
gtv ¼ u2Atv
gu qðu;vÞ xðv; uÞ ð2Þ
The attitude weight will be updated once the node is activated and it will be never
changed. Intuitively, the nodes in the LT-A model have three states: active with
positive attitude, active with negative attitude and inactive.
Maximizing Positive Influence in Signed Social Networks 361
Let cG ðu; vÞ denotes the set of the shortest path from u to v, let Yu denotes the set of
all neighbors influenced by node u.
362 H. Wang et al.
Let cG ðS; vÞ denotes the shortest path from the node in set S to node v in subgraph
0 0
G0 . d G ðS; vÞ denotes the length of the shortest path. Then we can get d G ðS; vÞ
0 0 0
d G ðT; vÞ. If d G ðu; vÞ d G ðS; vÞ, we have dG0 ðS[fugÞ dG0 ðSÞ ¼ dG0 ðT[fugÞ
0 0
dG0 ðT Þ. If d G ðu; vÞ d G ðT; vÞ, we have dG0 ðS[fugÞ dG0 ðSÞ ¼ dG0 ðT[fugÞ
0
dG0 ðSÞ dG0 ðT[fugÞ dG0 ðT Þ as dG0 ðSÞ is monotonically decreasing. If d G ðT; vÞ\
0 0
d G ðu; vÞ\d G ðS; vÞ, we get dG0 ðS[fugÞ dG0 ðSÞ [ 0 ¼ dG0 ðT[fugÞ dG0 ðT Þ.
Therefore, dG ðSÞ on S is submodular.
dðÞ denotes the influence spread function. Given a seed set S, the value of dðSÞ
represents the expected number of nodes which are active with positive attitude by S
under the LT-A propagation model.
PIM problem is to find a seed node set which can make the positive influence
maximized, it can be formalized as:
When solving the PIM problem, it is assumed that when the node is selected as a
seed node, the value of its attitude weight η is 1. And, we only choose the positive node
as a seed set.
Theorem 3. For any social network graph G ¼ ðV; E; x; qÞ, the problem of finding
the seed set S which makes dðSÞ maximized is NP-hard.
Proof 3. To prove this, firstly, a special instance under the LT-A model is considered.
As an unsigned social network can be regarded as a special signed social network, the
relationship weight of each node ρ is 1 and only positive influence is propagated under
LT model in the signed networks. As IM problem in the social network is NP-hard, in
Maximizing Positive Influence in Signed Social Networks 363
6 Experiments
6.1 Datasets
We use two datasets: Epinions and Slashdot. In both datasets, the relationships between
users are explicitly labeled. We can get the two network datasets from Stanford Large
Network Dataset Collection in [21, 22]. The statistics of the two datasets is showed in
Table 2.
• Epinions. An online product review site where users determine to trust or distrust
others by seeing their product reviews and ratings.
• Slashdot. A technology-related news website where users can submit and evaluate
current primarily technology oriented news. Friend and foe links between the users
are contained in this network.
6.2 Algorithm
We conduct experiments on above-mentioned two datasets by using our LT-A greedy
algorithm and other algorithms. Then we evaluate their experimental results. The other
algorithms are the greedy algorithm based LT model [4] (LT Greedy algorithm),
Positive Out-Degree algorithm [13] and Random algorithm [1]. The LT Greedy
algorithm is a general greedy algorithm with CELF optimization on LT model. The
Positive Out-Degree algorithm is an algorithm that we select top k largest positive
out-degree nodes as the seed set. The Random algorithm is an algorithm that we
randomly select k nodes as a seed set. The two heuristic algorithms are based on our
new model, LT-A.
The parameters in the experiments are set as follows. For each node, the value of
threshold θ can be a random value between 0 and 1, as stated in Sect. 3, the attitude η of
seed nodes is set to 1, and the rest of node attitude can be computed by the formula (2).
For each edge, the value of influence weight ω can also be a random value between 0
and 1, which must be satisfied xðu; vÞ½01 and criteria (1). The value of relationship
weight ρ can be achieved from the real-world datasets, Epinions and Slashdot. As too
large number of seed nodes can bring into long execution time and previous works are
always set k = 50, we set the number of seed nodes k to be 50, we compare make
comparisons of the positive influence spread in different sizes of seed set.
Maximizing Positive Influence in Signed Social Networks 365
7 Conclusions
References
1. Kempe, D., Kleinberg, J., Kleinber, J.: Maximizing the spread of influence through a social
network. In: Knowledge Discovery and Data Mining (KDD), pp. 137–146 (2003)
2. Domingos, P., Richardson, M.: Mining the network value of customers. In: Knowledge
Discovery and Data Mining (KDD), pp. 57–66 (2001)
3. Richardson, M., Domingos, P.: Mining knowledge-sharing sites for viral marketing. In:
Knowledge Discovery and Data Mining (KDD), pp. 61–70 (2002)
4. Leskovec, J., Krause, A., Guestrin, C., Faloutsos, C., VanBriesen, J., Glance, N.:
Cost-effective outbreak detection in networks. In: Knowledge Discovery and Data Mining
(KDD), pp. 420–429 (2007)
5. Goyal, A., Lu, W., Lakshmanan, L.V.S.: CELF++: optimizing the greedy algorithm for
influence maximization in social networks. In: International Conference Companion on
World Wide Web (WWW), pp. 47–48 (2011)
6. Chen, W., Wang, C., Wang, Y.: Scalable influence maximization for prevalent viral
marketing in large scale social networks. In: Knowledge Discovery and Data Mining (KDD),
pp. 1029–1038 (2010)
7. Chen, W., Yuan, Y., Zhang, L.: Scalable influence maximization in social networks under
the linear threshold model. In: International Conference on Data Mining (ICDM), pp. 88–97
(2010)
8. Lu, W., Lakshmanan, L.V.S.: Simpath: an efficient algorithm for influence maximization
under the linear threshold model. In: International Conference on Data Mining (ICDM),
pp. 211–220 (2011)
9. Lu, Z., Fan, L., Wu, W., Thuraisingham, B., Yang, K.: Efficient influence spread estimation
for influence maximization under the linear threshold model. Comput. Soc. Netw. 1(1), 1–19
(2014)
10. Chen, W., Collins, A., Cummings, R., Ke, T., Liu, Z., Rincon, D., Yuan, Y.: Influence
maximization in social networks when negative opinions may emerge and propagate. In:
SDM vol. 11, pp. 379–390 (2011)
Maximizing Positive Influence in Signed Social Networks 367
11. Zhang, H., Dinh, T.N., Thai, M.T.: Maximizing the spread of positive influence in online
social networks. In: Distributed Computing Systems (ICDCS), pp. 317–326. IEEE Press
(2013)
12. Li, S., Zhu, Y., Li, D., Kim, D., Ma, H., Huang, H.: Influence maximization in social
networks with user attitude modification. In: International Conference on Communications
(ICC), pp. 3913–3918. IEEE Press (2014)
13. Li, D., Xu, Z.M., Chakraborty, N., Gupta, A., Sycara, K., Li, S.: Polarity related influence
maximization in signed social networks. PLoS ONE 9(7), e102199 (2014)
14. Hassan, A., Abu-Jbara, A., Radev, D.: Extracting signed social networks from text. In:
Association for Computational Linguistics Workshop Proceedings of TextGraphs-7 on
Graph-based Methods for Natural Language Processing, pp. 6–14 (2012)
15. Rozin, P., Royzman, E.B.: Negativity bias, negativity dominance, and contagion. Pers. Soc.
Psychol. Rev. 5(4), 296–320 (2001)
16. Baumeister, R.F., Bratslavsky, E., Finkenauer, C.: Bad is stronger than good. Rev. Gen.
Psychol. 5(4), 323–370 (2001)
17. Peeters, G., Czapinski, J.: Positive-negative asymmetry in evaluations: the distinction
between affective and informational negativity effects. Eur. Rev. Soc. Psychol. 1, 33–60
(1990)
18. Taylor, S.E.: Asymmetrical effects of positive and negative events: the mobilization-
minimization hypothesis. Psychol. Bull. 110(1), 67–85 (1991)
19. Li, Y., Chen, W., Wang, Y., Zhang, Z.L.: Influence diffusion dynamics and influence
maximization in social networks with friend and foe relationships. In: International
Conference on Web Search and Data Mining (WSDM), pp. 657–666 (2013)
20. Bhagat, S., Goyal, A., Lakshmanan, L.V.: Maximizing product adoption in social networks.
In: International Conference on Web Search and Data Mining (WSDM), pp. 603–612 (2012)
21. Epinions social network. http://snap.stanford.edu/data/soc-Epinions1.html
22. Slashdot social network. http://snap.stanford.edu/data/soc-Slashdot0811.html
OpenFlow-Based Load Balancing for Wireless
Mesh Network
1 Introduction
Wireless mesh networks are intended for the last mile broadband Internet access to
extend or enhance Internet connectivity for mobile clients, which can provide high
network throughput, as well as optimal load balancing. In a typical mesh network,
mesh routers collect information from local mesh nodes, and the routing algorithm
decides the forwarding paths according to these information. Traditional wireless mesh
routing algorithms are usually distributed, so that the solution of the network is usually
deployed on each mesh nodes. It is difficult to increase the routing algorithm, and then
achieve the higher network throughput.
In wireless mesh networks, load balancing is critical. Load imbalance makes some
nodes become bottleneck nodes. Because the forwarding traffic of these nodes are too
much, the network performance will decline throughout the network. How to build
optimal mesh networks with load balancing has been studied theoretically. A variety of
routing algorithms have been put forward to solve the load balancing problem in mesh
networks. However, these algorithms can’t dynamically adapt to current network
topology and dataflow changes, avoid the bottleneck node, and select the most stable
link to establish a route.
© Springer International Publishing Switzerland 2015
Z. Huang et al. (Eds.): ICCCS 2015, LNCS 9483, pp. 368–379, 2015.
DOI: 10.1007/978-3-319-27051-7_31
OpenFlow-Based Load Balancing for Wireless Mesh Network 369
2 Related Work
There has been significant prior works on load balancing strategies based on traditional
wireless mesh networks. Most prior works focus on distributed algorithms, where the
mesh nodes communicate only with their neighborhood [3, 4]. Routing protocols for
mesh networks can generally be divided into proactive routing, reactive routing, and
hybrid routing strategies [2]. Nevertheless, most of these protocols do not provide load
balancing strategy. Because of the put forwarding of OpenFlow, a centralized algo-
rithm in WMN is possible.
In the [5], the combination of the OpenFlow and the WMN has been proposed for
the first time. In that paper, OpenFlow and a distributed routing protocols (OLSR) are
combined in a Wireless Mesh Software Defined Network (wmSDN). Load balancing
using both wireless mesh network protocols and OpenFlow is discussed in [6]. The
experiments in the paper demonstrate the improved performance of OpenFlow over
traditional mesh routing protocols. As it shows that the OpenFlow controller is able to
make centralized decisions on how to optimally route traffic so that the computational
burden at each node is minimized. However, the study does not consider the link
quality between the mesh nodes and the topology of the network does not contain the
gateway nodes. In the [9], the author proposed a prototype mesh infrastructure where
flows from a source node can take multiple paths through the network based on
OpenFlow to solve the load balancing for wireless mesh network. However, the study
doesn’t adequately consider the multi radio interfaces of the mesh nodes. Therefore,
considering the characteristics of the wireless mesh network, we proposed a solution
for multi-interfaces wireless mesh networks. We also have implemented the
OpenFlow-enabled mesh network testbed which each node in the testbed has two radio
interfaces.
370 H. Yang et al.
if a WMR loses communication with the controller, the new flows fail transmitting; if
the SDN controller fails, the whole network breaks down. Furthermore, centralized
control for WMN would require transferring a considerable amount of status infor-
mation and configuration commands between WMN nodes and the centralized control
entity, which will cause the longer delays. To deploy an appropriate SDN strategy, we
can make better use of SDN technologies.
3.3 OpenFlow
OpenFlow was originally designed for providing a real experiment platform to campus
network researchers designing innovation network architecture, then McKeown et al.
[10] started promoting SDN concept, and the concept aroused wide attention of aca-
demia and industry. It is a new switching protocol based on the concept of software
defined networking(SDN).
OpenFlow-enabled switches move packet forwarding intelligence to the OpenFlow
controller, while keeping the switches simple. In legacy switch structures, there is a
control domain and a forwarding domain. As there is no control domain residing at an
OpenFlow-based switch, the forwarding domain can be kept simple, and they do the
forwarding function based on the flow tables. The functionality of the control domain is
now moved to the control network, which was referred to as at least one OpenFlow
controller or more. The controller is connected to every switch by a secure channel,
using the OpenFlow protocol. OpenFlow makes packet forwarding and routing more
intelligent than legacy routing solutions, making it possible to develop more complex
routing protocols that further improve network performance. In our implementation, we
utilize the protocol OpenFlow 1.3 [7] to construct our OpenFlow-based wireless mesh
node.
Our testbed was a small scale wireless mesh network consisted of at least four wireless
routers which was placed as depicted in Fig. 2. The experiments were conducted inside
our lab building.
All the wireless routers in our testbed are the PC Engines ALIX series of system
boards, alix2d2, and their features shown in Table 1. We use the Linux-based x86
platform due to the extensible memory, radio adaptability via miniPCI cards, low
power consumption, and low cost. Wireless routers have 2 Ethernet channels, 2
miniPCI slots which can use 802.11 a/n or 802.11 g/b wireless cards as wireless
interfaces. The firmware of the wireless routers was replaced with the custom
OpenWRT firmware [11], a system can be described as a very well-known embedded
Linux distribution.
For the firmware we used the OpenWRT trunk r43753, which is based on Linux
kernel version 3.14.26. This version of OpenWRT supports the OpenvSwitch 2.3. In
these experiments, we use a custom POX controller, which is a controller based on
NOX for rapid deployments of SDNs using Python as our OpenFlow controller.
Because the interface of the wireless radio is limited, we use out-of-band control
network with a wired connection.
In this section we present our solution for load balancing in OpenFlow-based wireless
mesh network, and the results of our experiment are implemented in our testbed. We
make use of network measurement tool-iPerf for throughput and bandwidth
measurements.
374 H. Yang et al.
All kinds of data servers are deployed in the Internet, from where the mesh networks
request data. To setup a connection between node a and node b is our goal in this
section. Assume node b already has a connection to node g. Now consider a scenario
where a new node a joins the network. It needs to find its next-hop neighbor so that
node a can communicate with the Internet. In this scenario, node b is the neighbor, and
a wireless link using channel 1 represents the added flow path.
Experimental Procedure. To implement this scenario, traditional solutions require
relatively high local computation capability. We show how to add the data path using
OpenFlow controller in this section. As shown in Table 2, controller addressed this by
sending flow tables and configuration instructions to mesh nodes, and no further action
is required at local mesh routers. This setup servers as the foundation for the flow
redirection.
When node a want to connect with node b, the controller will firstly establish an
available wireless link by sending configuration instructions to them, setting the same
channel and mesh_id. Thereafter, for node a, we identify the virtual OpenFlow switch
as id_node_a, and the ingress port where data packets origination from node a labeled
as port_node_a. When the OpenFlow switch receives data flow matching flow rules,
the header of the packet will be manipulated in case of following the flow actions. In
our scenario, the destination of node a’s packets is node b. Hence, packets form node
a must modify their destination IP and MAC addresses. The set_dst_ip and set_dst_-
mac fields are used to rewrite packet headers, so the node_b_ip and node_b_mac are
the IP and MAC addresses of node b, respectively. The modified packets must be
output through the wireless radio interface, defined as node_a_port. The node b’s
configuration is similar to node a, except the destination. According to the appropriate
rules, the gateway will forward the packets which the IP of the destination is out of this
network segment.
Flow tables are pushed by the controller, and after that the data path between node
a and gateway g is established. In this scenario, the host communicates with the
Internet with two hops link. The iPerf measurement tool shows the TCP throughput
averaging at 4.04 Mbits between node a and the Internet. The measurement was
maintained for 10 min and repeated five times.
376 H. Yang et al.
Table 2. Flow tables and configuration instructions(basic setup of the data flow paths)
Node A Node B
Configuration Instructions mesh channel: 1 mesh channel: 1
meshid: MeshTrain meshid: MeshTrain
Flow rule switch: id_node_a switch: id_node_b
port: port_node_a port: port_node_b
Flow actions (forward) set-dst-ip: node_b_ip set-dst-ip: destination_ip
set-dst-mac: node_b_mac set-dst-mac: destination_mac
output: node_a_port output: node_b_port
Fig. 5. Data flow path redirection between links Fig. 6. Throughput before and after path
redirection
Experimental Description. Node a, b and c are mesh nodes, and node g is an Internet
gateway with a connection to the Internet. Assume a link is already established from
node a and to gateway g via node b, as described in last section. The target of this
section is to redirect data flows from node a to node c while node b experiences
unexpected conditions or a new data packet arrives. This is a simple sample for load
balancing.
Experimental Procedure. Flow tables and configuration instructions for this scenario
are shown in Table 3. When we want to redirect a new data path, we must remove the
old one. For node a, we remove the flow actions for a-b link, and send a new flow table
for a-c by modified the destination IP and MAC addresses with node c’s(node_c_ip and
node_c_mac). Node c modifies the packet headers to the destination.
OpenFlow-Based Load Balancing for Wireless Mesh Network 377
In our experiment, node a sends data to gateway g via node b before data flow
redirection. Due to the reduction of the node b’s signal, the system decides to adjust the
data link. The new path offers higher average throughput because we use the 802.11 a/n
wireless radio card while the former uses 802.11 b/g wireless radio card. Figure 6
shows the TCP throughput performance results measured by iPerf before and after
redirection. It can be seen that throughput increases after flows are redirected to the new
path with stable wireless channels.
Table 3. Flow tables and configuration instructions(data flow path redirection between links)
Node A Node C
Configuration instructions mesh channel: 36 mesh channel: 36
meshid: MeshTrain meshid: MeshTrain
Flow rules switch: id_node_a switch: id_node_c
port: port_node_a port: port_node_c
Flow actions (remove) set-dst-ip: node_b_ip
set-dst-mac: node_b_mac
output:node_a_port
Flow actions (forward) set-dst-ip: node_c_ip set-dst-ip: destination_ip
set-dst-mac: node_c_mac set-dst-mac: destination_mac
output: node_a_port output: node_c_port
References
1. Jagadeesan, N.A., Krishnamachari, B: Software-defined networking paradigms in wireless
networks: a survey. ACM Comput. Surv. 47(2), 11 p. (2014). Article 27, doi:10.1145/
2655690
2. Alotaibi, E., Mukherjee, B.: A survey on routing algorithms for wireless ad-hoc and mesh
networks. Comput. Netw. 56(2), 940–965 (2012)
3. Hu, Y., Li, X.-Y., Chen, H.-M., Jia, X.-H.: Distributed call admission protocol for
multi-channel multi-radio wireless networks. In: Global Telecommunications Conference,
2007, GLOBECOM 2007, pp. 2509–2513. IEEE 26–30 November 2007
OpenFlow-Based Load Balancing for Wireless Mesh Network 379
4. Brzezinski, A., Zussman, G., Modiano, E.: Distributed throughput maximization in wireless
mesh networks via pre-partitioning. IEEE/ACM Trans. Networking 16(6), 1406–1419
(2008)
5. Detti, A., Pisa, C., Salsano, S., Blefari-Melazzi, N.: Wireless mesh software defined
networks (wmSDN). In: 2013 IEEE 9th International Conference on Wireless and Mobile
Computing, Networking and Communications (WiMob), vol. 6983, pp. 89–95. IEEE (2013)
6. Chung, J., Gonzalez, G., Armuelles, I., Robles, T., Alcarria, R., Morales, A.: Characterizing
the multimedia service capacity of wireless mesh networks for rural communities. In: 2012
IEEE 8th International Conference on Wireless and Mobile Computing, Networking and
Communications (WiMob), pp. 628–635. IEEE (2012)
7. Open Networking Foundation. https://www.opennetworking.org/
8. Akyildiz, I.F., Wang, X.: A survey on wireless mesh networks. IEEE Commun. Mag. 43(9),
S23–S30 (2005)
9. Yang, F., Gondi, V., Hallstrom, J.O., Wang, K.C., Eidson, G.: OpenFlow-based load
balancing for wireless mesh infrastructure. 2014 IEEE 11th Consumer Communications and
Networking Conference (CCNC), pp. 444–449. IEEE (2014)
10. Parulkar, G.M., Rexford, J., Turner, J.S., Mckeown, N., Anderson, T., Balakrishnan, H.,
et al.: Openflow: enabling innovation in campus networks. ACM SIGCOMM Comput.
Commun. Rev. 38(2), 69–74 (2008)
11. OpenWrt. https://openwrt.org/
SPEMS: A Stealthy and Practical Execution
Monitoring System Based on VMI
Abstract. Dynamic analyzing has been proposed for over decades to tracing
the execution of programs. However, most of them need an agent installed
inside the execution environment, which is easy to be detected and bypassed. To
solve the problem, we proposed a system named SPEMS which utilized virtual
machine introspection (VMI) technology to stealthily monitor the execution of
programs inside virtual machines. SPEMS integrates and improves multiple
open-source software tools. By inspecting the whole process of sample prepa-
ration, execution tracing and analysis, it is able to be applied in large scale
program monitoring, malware analyzing and memory forensics. Experiments
results show our system has remarkable performance improvement compared
with former works.
1 Introduction
Although VMI is suitable for stealthy monitoring, these solutions are either
performance costing or only usable under strictly limited condition, which makes them
unpractical in large scale of malware analyzing as Cuckoo does. To make VMI prac-
tically usable, we inspected the entire process of malware analyzing, including virtual
environment preparation, execution tracing and result analyzing. We utilized some of
the open-source works and improved them to make it practical usable. Our main con-
tribution is an integrated VMI-based program execution monitoring system, namely
SPEMS, which includes VM snapshotting, out-of-box file injecting, process injecting
and execution tracing. Our system can also be used in forensic and intrusion detection.
The rest of the paper is organized as follows: The second part discussed the
development of VMI technique and related works. The main job and implementation is
discussed in the third part. Related experiments and results are discussed in fourth
part. Finally, a conclusion is drawn in the last part.
2 Related Works
VMI is divided into two kinds in [5], namely in-band VMI and out-of-band VMI. Lares
[13] is a typical in-band VMI which is designed for protecting in-guest security tools
and transfer information to hypervisor. While in-band VMI has advantages of easy to
implement and low performance cost, it is easy to be detected by malware. So we
mainly discuss out-of-band VMI which is also the main trend of VMI developing.
LibVMI [8], formerly called XenAccess, is a C library with Python bindings that
makes it easy to monitor the low-level details of a running virtual machine by viewing its
memory, trapping on hardware events, and accessing the vCPU registers. Works related
with LibVMI include include VMwall [6], VMI-Honeymon [9] and RTKDSM [10]. They
are more focus on obtaining guest information passively rather than active trigger the VM
to get what we need, which limit their usage in monitoring. Some solutions like NICKLE
[11], SecVisor [12], Lares [13] deal with protection of the guest OS’ code or data
structures, but none could deal with the user-space program execution tracing.
Recent works like SPIDER [14], DRAKVUF [15], Virtuoso [16] and VMST
[17, 18] actively inspect a VM’s state, which used in VM configuration and intrusion
defense. SPIDER is a VMI-based debugging and instrumentation using instruction-
level trapping based on invisible breakpoint and hardware virtualization. Spider hides
the existence of invisible breakpoint in the guest memory by utilizing the Extended
Page Table (EPT) to split the code and data view seen by the guest, and handles
invisible breakpoint at the hypervisor level to avoid any unexpected in-guest execution.
All of the above tools or projects cannot provide practical usage in large-scale
malware analysis either for performance reasons or for automation reasons. DRAKVUF
use #BP injection technique of SPIDER to initiate the execution of malware samples and
monitor kernel internal functions and heap allocations. Then it use LibVMI to intercept
Xen events and mapping guest memory to host. Although DRAKVUF reduces
resources cost via copy-on-write memory and disk, the CPU cost is still very high. And
it is only capable of starting existing malware samples, not able to inject samples.
What’s more, DRAKVUF cannot monitor user-level function calls. To make DRAK-
VUF practically usable, we improved it by adding support of monitoring function calls
382 J. Shi et al.
of specified process and module, including user-level function calls. And we proposed
an out-of-band based injection method to submit malware samples into VM.
3 Implementation
The main framwork of our system SPEMS is shown in Fig. 1. It is consisted with three
parts: sample preparation, execution tracing and system calls analysis. The sample
preparation part is for preparing the execution environment for malware samples.
Considered the increasing amount of malware samples, we designed a workflow for
snapshots taking, virtual machine clone and sample injecting to increase the reusability
and productivity. The execution tracing part is the core part of our system, which
utilized open-source tools such as LibVMI and Rekall to parse the guest OS symbol
and translate guest memory to accessible. Moreover, we improved DRAKVUF in some
aspects to reduce the CPU cost and refine the result. The system calls analysis part is
used to further analyzing the results obtained from execution tracing. The actual
analysis method can be adjustable and various according to actual demand. In order to
keep the generosity, we are not going to discuss the detailed implementation of this
part. But to who may concern, please referred to our former works [21, 22] for details
of using system calls sequence to further analyzing the behavior of malware.
xentools are file based. Hot-plug block devices and live modification of VM disk are
volume based. Even though commercial products such as VMware ESX and Citrix
XenServer both can share file between the host and the VM by tools like vmtool and
xentool, it is not available to open-source virtualization software such as KVM and
Xen. Besides, it is also a risk of VM escape attack as reported earlier [19]. To avoid the
use of network connection and ensure the isolation between host and VM, we utilize
guestfish [20] to inject malware samples into the VM. Guestfish is a shell and
command-line tool which uses libguestfs and exposes all of the functionality of the
libguestfs API. Guestfish provided api to copy file into and out of VM, which we used
to copy samples into VM.
While it’s easy to take snapshot and recovery VMs for XenServer and ESX, even
KVM provides snapshot mechanism, it is not officially done to Xen. But there have
been a lot of unofficial solutions to this, such as byte-by-byte snapshot using xm and
dd, logical volume manager (LVM) incremental method, rsync based, and ZFS based.
Among them, LVM’s copy-on-write (cow) disk capability and Xen’s native cow
memory interface is very suitable to backup Xen VM state and recovery it in a short
period of time. We implement it with a script as follows. The total time of snapshotting
is less than 10 s of 1 GB memory and 1 GB backend logical volumn VM disk in our
test. And the recovery stage costs less than 5 s. The time cost of injection stage is
related with the sample file size. In our test, it is almost the same speed as local disk
write operation, except that the mounting of guest disk to host costs additional 20 s
more or less. Combined LVM with guestfish, it is able to inject malware samples into
VM disk without leaving traces and quickly recovery disk to a clean state after the
analysis is completed. The code of these processes is listed below.
It uses technology such as cow disk and cow memory to reduce the resource cost.
Besides, it uses vlan to separate the cloned vms. For these reasons, DRAKVUF is very
suitable in analyzing malware samples. The main structure of DRAKVUF is shown in
Fig. 2.
However, the CPU cost of DRAKVUF is heavy to as much as 100 % in our test.
The reasons for this are DRAKVUF traps all kernel functions and intercept all heap
allocations and file operations. Besides, DRAKVUF only traps kernel functions. Even
though kernel functions are more important and can be used to detect rootkit, it doesn’t
represent the general behavior of malware. To reduce the CPU cost and make
SPEMS: A Stealthy and Practical Execution 385
After the modification, we can get the user-level function calls information of the
given malware process, which will be further processed in next stage of function calls
analyzing. For details of the analyzing process, please refer to our former works [21, 22].
4 Evaluation
To test the effect of our system, we designed several experiments. The hardware
configuration of our experiments is a Dell PowerEdge R730xd server with Intel Xeon
E5-2603 v3 CPU, 2 TB disk and 32 GB memory. The virtualization environment is
Xen 4.4.1 with network configured by openvswitch.
The first experiment is to evaluate the snapshot method for Xen. As shown in
Fig. 3, the time of a 20 GB LVM disk creation is less than 1 s which can be ignored.
The memory saving operation costs 7 s, 10 s, 14 s respectively for saving memory size
of 1 GB, 2 GB and 4 GB. The major costing is injecting samples, this is because the
injection needs to mount guest disk. But the time increases slowly with the size of the
disk, as shown Fig. 3. The time of memory saving and recovering increase linearly with
the size of memory, but are all less than 15 s. The clone of VMs mainly consists of
LVM disk creation which costs less than one second and can be ignored. Overall, the
total process of sample preparation costs less than one minute, which is acceptable in
practical use, especially when running multiple cloned VMs concurrently.
The second experiment is to compare SPEMS with DRAKVUF. The CPU cost of
SPEMS drops critically compared with DRAKVUF during the test, from nearly 100 %
386 J. Shi et al.
Fig. 3. Time cost of sample preparation under different memory and disk size
to about 10 %. This difference causes the programs’ execution time differs greatly as
shown in Table 2. While DRAKVUF runs, it is almost impossible to do other oper-
ations on VM and the execution is slow as a snail’s pace, which is abnormal and would
lead to the disclosure of monitoring existence. SPEMS are 13 times faster than
DRAKVUF in average. We attribute this improvement to the use of filter mechanism.
This improvement, combined with the snapshot and clone mechanism, makes our
system more stealthy and practical usable. Even though SPEMS dramatically reduced
the time cost of DRAKVUF, there is still time gap between the origin execution and the
VMI-based execution. This time is hard to vanish as the trapping and dealing of
breakpoints will cost time. However, as long as we can keep the normal execution of
user programs, it is possible to utilize active VMI as shown in [23] to modifying the
time value returned by relative function calls to deceive user-level applications.
Besides, the top 5 extracted function calls of different applications during 5 min
compared with the original DRAKVUF is shown in Table 3. We selected three typical
applications to respectively represent the operation on file, network, and registry. File
operation is simulated by uncompressing a collection of 700 malware samples. Net-
work operation is simulated by accessing one of the most famous web portals (www.
msn.cn). Registry operation is simulated by RegBak (a famous registry backup and
SPEMS: A Stealthy and Practical Execution 387
restore tool). From Table 3 we can see that the top 5 trapped functions of SPEMS are
totally different with DRAKVUF because we added a lot of user-level function calls
while DRAKVUF only traps kernel-level functions. Even though we filtered out many
less sensitive kernel functions, the important information is clearer, rather than missed.
Take the WINZIP program as an example, DRAKVUF traces large amounts of kernel
functions such as ExAllocatePoolWithTag and NtCallbackReturn, which cannot rep-
resent the general behavior of uncompressing. While SPEMS successfully traced the
function calls of loading libraries related with uncompressing.
Table 3. TOP 5 function calls (The gray rows are the result of DRAKVUF while the white ones
are the result of SPEMS, figures behind the function names are the number of calls of this
function. Functions in bold represent the most related behavior to the program.)
5 Conclusions
Dynamic analyzing of malware behaviors have been developed for decades and VMI
technic has also been proposed for over last decade. Combining the two to accomplish
stealthy and automatic malware analyzing has been the pursuing of security researching
field. However, due to high performance cost and limited application area, past works
are not really practical. To improve this, we inspected the whole process of malware
analyzing, including virtual environment preparation, execution tracing and result
analyzing. Our main contribution is an integrated VMI-based malware execution
388 J. Shi et al.
References
1. Willems, G., Holz, T., Freiling, F.: Toward automated dynamic malware analysis using
CWSandbox. IEEE Secur. Priv. 5, 32–39 (2007)
2. Bayer, U., Moser, A., Kruegel, C., Kirda, E.: Dynamic analysis of malicious code.
J. Comput. Virol. 2, 67–77 (2006)
3. Cuckoobox. http://www.cuckoosandbox.org/
4. Garfinkel, T., Rosenblum, M.: A virtual machine introspection based architecture for
intrusion detection. In: NDSS, pp. 191–206 (2003)
5. Jiang, X., Wang, X., Xu, D.: Stealthy malware detection through vmm-based out-of-the-box
semantic view reconstruction. In: Proceedings of the 14th ACM Conference on Computer
and Communications Security, pp. 128–138 (2007)
6. Srivastava, A., Giffin, J.T.: Tamper-resistant, application-aware blocking of malicious
network connections. In: Lippmann, R., Kirda, E., Trachtenberg, A. (eds.) RAID 2008.
LNCS, vol. 5230, pp. 39–58. Springer, Heidelberg (2008)
7. Nance, K., Bishop, M., Hay, B.: Investigating the implications of virtual machine
introspection for digital forensics. In: ARES 2009 International Conference on Availability,
Reliability and Security, 2009, pp. 1024–1029 (2009)
8. Payne, B.D.: Simplifying virtual machine introspection using libvmi. Sandia report (2012)
9. Lengyel, T.K., Neumann, J., Maresca, S., Payne, B.D., Kiayias, A.: Virtual machine
introspection in a hybrid honeypot architecture. In: CSET (2012)
10. Hizver, J., Chiueh, T.C.: Real-time deep virtual machine introspection and its applications.
In: Proceedings of the 10th ACM SIGPLAN/SIGOPS International Conference on Virtual
Execution Environments, VEE 2014, pp. 3–14. ACM, New York (2014)
11. Riley, R., Jiang, X., Xu, D.: Guest-transparent prevention of kernel rootkits with
VMM-based memory shadowing. In: Lippmann, R., Kirda, E., Trachtenberg, A. (eds.)
RAID 2008. LNCS, vol. 5230, pp. 1–20. Springer, Heidelberg (2008)
12. Seshadri, A., Luk, M., Qu, N., Perrig, A.: SecVisor: a tiny hypervisor to provide lifetime
kernel code integrity for commodity OSes. In: Proceedings of Twenty-First ACM SIGOPS
Symposium on Operating Systems Principles, SOSP 2007, vol. 41, pp. 335–350. ACM,
New York (2007)
13. Payne, B.D., Carbone,M., Sharif,M., Lee, W.: Lares: an architecture for secure active
monitoring using virtualization. In: Proceedings of the 2008 IEEE Symposium on Security
and Privacy, SP 2008, pp. 233–247. IEEE Computer Society, Washington, DC (2008)
14. Deng, Z., Zhang, X., Xu, D.: Spider: stealthy binary program instrumentation and debugging
via hardware virtualization. In: Proceedings of the 29th Annual Computer Security
Applications Conference, pp. 289–298 (2013)
SPEMS: A Stealthy and Practical Execution 389
15. Lengyel, T.K., Maresca, S., Payne, B.D., Webster, G.D., Vogl, S., Kiayias, A.: Scalability,
fidelity and stealth in the drakvuf dynamic malware analysis system. In: Proceedings of the
30th Annual Computer Security Applications Conference, pp. 386–395 (2014)
16. Dolan-Gavitt, B., Leek, T., Zhivich, M., Giffin, J., Lee, W.: Virtuoso: narrowing the
semantic gap in virtual machine introspection. In: IEEE Symposium on Security and Privacy
(SP), pp. 297–312. IEEE, New York (2011)
17. Fu, Y., Lin, Z.: Space traveling across VM: automatically bridging the semantic gap in
virtual machine introspection via online Kernel data redirection. In: Proceedings of the 2012
IEEE Symposium on Security and Privacy, SP 2012, pp. 586–600. IEEE Computer Society,
Washington, DC (2012)
18. Fu, Y., Lin, Z.: Bridging the semantic gap in virtual machine introspection via online Kernel
data redirection. ACM Trans. Inf. Syst. Secur. 16(2) (2013)
19. VM escape. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2008-0923
20. Guestfish. http://libguestfs.org/guestfish.1.html
21. Qiao, Y., Yang, Y., He, J., Tang, C., Liu, Z.: CBM: free, automatic malware analysis
framework using API call sequences. Adv. Intell. Syst. Comput. 214, 225–236 (2014)
22. Qiao, Y., Yang, Y., Ji, L., He, J.: Analyzing malware by abstracting the frequent itemsets in
API call sequences. In: Proceedings of the 12th IEEE International Conference Trust
Security Privacy Computing and Communications (TrustCom), 2013, pp. 265–270 (2013)
23. Dinaburg, A., Royal, P., Sharif, M., Lee, W.: Ether: malware analysis via hardware
virtualization extensions. In: Proceedings of the 15th ACM Conference on Computer and
Communications Security, pp. 51–62 (2008)
24. Mei, S., Wang, Z., Cheng, Y., Ren, J., Wu, J., Zhou, J.: Trusted bytecode virtual machine
module: a novel method for dynamic remote attestation in cloud computing. Int. J. Comput.
Intell. Syst. 5, 924–932 (2012)
25. Shuang, T., Lin, T., Xiaoling, L., Yan, J.: An efficient method for checking the integrity of
data in the cloud. China Commun. 11, 68–81 (2014)
A Total Power Control Cooperative MAC
Protocol for Wireless Sensor Networks
1 Introduction
and obtain spatial diversity gain, it exploits the distributed antennas to form a virtual
MIMO system, which overcomes the difficulties such as implementing more than one
antenna to a small signal node. Most of the earlier cooperative communication works
mainly focus on the physical layer aspects [1, 2] while recent works examine the
cooperative technology considering the MAC layer [3–11] and network layer [12–14].
In recent years, interest in cross-layer cooperation has increased because of its sig-
nificant influence in both the theory and the practice of networks.
The MAC layer plays an essential role in data transmissions with complex con-
straints, which controls multi-user access through common wireless channels and
provides relative transparent data transmission services to upper layer. CoopMAC [3] is
the early cooperative MAC proposed by Pei Liu. Every node in CoopMAC needs to
maintain a cooperative node list recording the neighbors’ rate and time of the last
packet. When the source node needs to transmit a packet, it chooses a best relay node
according to the end-to-end minimal transmission time. CoopMAC improves the
reliability and throughput but costs more memory. The opportunistic cooperative MAC
protocol proposed in [4] is based on cross-layer technology and is designed to improve
the throughput by selecting the relay node which has the best channel. References [5, 6]
select the best relay by inter-group and intra-group competitions. The grouping and
competing method improves the throughput, but also costs additional energy in cor-
responding control packets overhead. To maintain the fairness and reduce the loss rate
of control frame, [7] proposed RCMAC. RCMAC transmits clear to send (CTS)
packets and acknowledge (ACK) packets through by the relay, which has the fastest
transmit rate. Based on 802.11, ECCMAC [8] improves the throughput by choosing the
relay which has the shortest transmit time and uses network coding. To improve the
throughput, STiCMAC [9] employs multiple relays and combines with the distributed
space-time coding.
As mentioned above, most of the proposed cooperative MACs are designed to
improve the throughput while relatively little works focus on reducing energy con-
sumption. But for energy constrained wireless networks, such as WSN, energy con-
servation is particular important. Motivated by this, we propose the total power control
MAC (TPC-MAC) protocol. The major contributions of this paper can be summarized
as follows:
– A distributed cooperative MAC protocol is proposed, which can introduce space
diversity to mitigate the multi-path fading.
– A relay selection strategy is discussed, which can lead to the minimal network total
transmit power and prolong the lifetime of the network.
The rest of this paper is structured as follows. In Sect. 2, System models for a
cooperative network and a non-cooperative network are presented. The theoretical total
transmission power for these two network models is analyzed. Section 3 shows the
details of the proposed protocol. Section 4 shows the numerical results. Finally, Sect. 5
concludes the paper.
392 X. Rui et al.
The follow paragraph analyzes the total transmit power of non-cooperative model
and cooperative model in decode and forward (DF).
where PsNoncoop is the transmit power of S, nsd is the AWGN with zero mean and the
variance N0, i.e. CN(0, N0). The channel variance r2sd of hsd is modeled as:
r2sd ¼ gDa
sd ð2Þ
A Total Power Control Cooperative MAC Protocol 393
AN0
BER ð3Þ
bPsNoncoop r2sd logM
2
where b ¼ sin2 ðp=M Þ, A ¼ ðM 1Þ=2M þ sinð2p=MÞ=4p. Take (2) into (3) and
consider the constraint of performance requirement BER ≤ ε, where ε denotes the
maximum allowable BER, the minimal total transmit power of S is given by:
AN0 Dasd
PsNoncoop ¼ ð4Þ
beglogM2
where PScoop is the transmit power of S in a cooperative model, hsd , hsr , nsr and nsd are
modeled as CN(0, r2sd ), CN(0, r2sr ), CN(0, N0) and CN(0, N0), respectively. During
transmission phase 2, if the relay node R correctly decodes the signal received from the
source node S, it helps forward the signal. If R decodes the signal incorrectly, it
discards the signal. The signal received at the destiny node D can be expressed as:
pffiffiffiffiffiffi
yrd ¼ PR hrd xðnÞ þ nrd ð7Þ
where PR denotes the transmit power at the relay node R, hrd is modeled as CN(0, r2rd )
and nrd is modeled as CN(0, N0). After transmission phase 1 and phase 2, the destiny
node D combines the signals ysd and yrd using maximum ratio combing (MRC).
So, when the relay node participates in the communication between the source node
and the destiny node, the total transmit power can be expressed as:
394 X. Rui et al.
If all channel links are available, considering the results of incorrectly decoding and
correctly decoding at the relay node R, averaging the conditional BER over the Ray-
leigh distributed random variables, the upper bound of average BER with M-PSK
modulation can be expressed as [15]:
A2 N02 B2 N02
BER þ ð9Þ
b2 P2Scoop rsd r2sr log2 M b2 PScoop PR r2sd r2rd log2 M
2
A2 N02 B2 N02
e¼ þ ð10Þ
b2 P2Scoop r2sd r2sr log2 M b2 PScoop PR r2sd r2rd log2 M
By setting the derivative to be zero, the values of PScoop and PR leading to the
minimum PSum can be expressed as:
pffiffiffiffiffiffiffiffiffiffiffiffi
PScoop ¼ E=2C ð12Þ
pffiffiffiffiffiffiffiffiffiffiffiffi.
PR ¼ 2E=C ðE 2DÞ ð13Þ
pffiffiffiffiffiffiffiffiffiffiffiffiffiffi
where, C ¼ b2 er2sd r2sr log2 M BN02 , D ¼ A2 r2rd Br2sr , E ¼ ð2D þ 1 þ 8D þ 1Þ 2.
When BPSK is applied, A ¼ 1=4; B ¼ 3=16; b ¼ 1.
So, if we select a relay which can lead to PSum \ PsNoncoop , the transmit power can
be reduced. But an unsuitable relay results in more energy consumption than direct
transmission. To show this, we make a simulation comparing the values of PSum and
PsNoncoop , because the formula comparison is very complicated. Figure 2 is the results.
The distance between S and D is 30 m. The parameters are shown in Table 1. In the
figure, the bigger two circles define the transmission range of S and D according the
max transmit power constraint Pmax . All nodes in the intersection of the two circles can
hear either S or D and have the capacity to act as a relay. But only those nodes located
in the dotted circle, such as R1, can lead to PSum \PsNoncoop . Nodes outside the dotted
A Total Power Control Cooperative MAC Protocol 395
circle, such as R2, lead to PSum [ PsNoncoop . When the control packets are considered,
the suitable relay range becomes smaller.
80
R2
60
40
R1
Y value
20
D S
0
-20
-40
-60
-80
-100
-100 -80 -60 -40 -20 0 20 40 60 80 100 120
X value
This section presents an overview of 802.11, discusses the basic idea of TPC-MAC
protocol, and introduces the details of the protocol.
Where k is the best relay, i denotes the candidate relay node, Pksum , Pisum denote the
total transmit power of the network with the respectively relay node k, i. N is the
number of candidate relay nodes.
packet and the CTS packet. r2sd can be calculated out according to the non-cooperative
model transmit power of the source node PsNoncoop extracted in the CTS packet. Then,
using formulas (12) and (13), the relay candidate node calculates the cooperative
transmit power PScoop , PR and PSum . If PsNoncoop is larger than PSum , the relay candidate
node set its countdown timer according the value of PSum . When the timer reaches zero
and the channel is idle, it transmits a HTS packet attached by PScoop . If PsNoncoop isn’t
larger than PSum or a HTS packet is already overheard transmitting, the relay candidate
node keeps silence. Thus, the best relay node, which leads to the minimum total
transmit power of the network is selected.
Relay Node. After successfully transmitting the HTS packet, the relay node waits for
the data packet transmitted by the source node. If the received data packet can be
decoded correctly, the relay forwards it to the destiny node using the calculated
transmit power PR.
The exchange of control packets in TPC-MAC and the corresponding NAV setting
are shown in Fig. 4. To notify the transmit power, we extend the CTS packet format
and add a 2bytes power field for PsNoncoop . The HTS packet format is designed as the
same of the CTS packet, and the 2bytes power field is used for PScoop . Figure 5 shows
the formats of a CTS packet and a HTS packet.
4 Simulation Results
Performances with BER constraint under BPSK modulation over different sensor
numbers are compared in Fig. 6. We note that, the lifetime performance is improved as
the sensor node number increases, because the total network energy is increased.
400 X. Rui et al.
MMRE performs the worst, even if it uses both the residual energy information
(REI) and CSI. This is because of the absence of cooperation communication mech-
anism. WcoopMAC employing a cooperative communication scheme and picking the
maximum residual energy node as the relay node performs much better than MMRE.
TPC-MAC performs the best performance. This is because the relay selection strategy
presented in Sect. 3.2 can reduce more energy consumption of the total network.
4000
TPC-MAC
3500 WcoopMAC
MMRE
3000
2500
Lifetime
2000
1500
1000
500
0
0 20 40 60 80 100 120 140 160 180 200
Number of sensors
4000
3500 TPC-MAC
WcoopMAC
3000 MMRE
2500
Lifetime
2000
1500
1000
500
0
-6 -5 -4 -3 -2
10 10 10 10 10
Average BER
Figure 7 considers the network lifetime under the different BER constraints. The
number of sensor node is 100. In order to see the performance with high BER
requirement, the max transmission power limit is not considered. According to formula
(4) or (10), when the BER requirement becomes higher, the transmit power becomes
A Total Power Control Cooperative MAC Protocol 401
larger because of the smaller BER value. So, we can see in Fig. 7, the network lifetimes
of the compared three kinds MAC increase all when the BER value increases. The
lifetime of TPC-MAC can remain at about twice as the lifetime of WcoopMAC when
the BER value is smaller than 105 . When the BER requirement is low, the trans-
mission energy saved by power optimization in WcoopMAC can’t make up the extra
energy consumption of the control packets. Thus, the performance of MMRE is better
than WcoopMAC when the BER requirement is low. In TPC-MAC, the candidate relay
node compares the total transmission power of cooperative and non-cooperative
scheme. If the total transmission power of cooperative scheme is larger than the non-
cooperative scheme, the candidate relay keeps silence. So the lifetime of TPC-MAC in
Fig. 7 is longer than WcoopMAC and MMRE.
100
TPC-MAC
90
WcoopMAC
80 MMRE
Average Wasted Energy
70
60
50
40
30
20
10
0
-6 -5 -4 -3 -2
10 10 10 10 10
Average BER
TPC-MAC
7 WcoopMAC
MMRE
6
Average Wasted Energy
0
0 20 40 60 80 100 120 140 160 180 200
Number of sensors
We compared the average wasted energy under different BER value with 100
sensor nodes in the network. Figure 8 shows the total network residual energy when the
network is dead. When the BER value is smaller than 104 , TPC-MAC is much more
efficient than WcoopMAC and MMRE in terms of energy savings.
Figure 9 considers the network average wasted energy with different sensor node
numbers under the constraint of BER value 103 . The average wasted energy in
WcoopMAC and MMRE almost present the linear growth as the number of sensor
node increases. The change of sensor node number does not have much impact on
TPC-MAC. This is because the TPC-MAC adaptively chooses the cooperative or
non-cooperative communication scheme and picks the best relay node which leads to
the minimal total transmit power.
5 Conclusion
Acknowledgement. This work was supported by Innovation Project of SCI&Tec for College
Graduates of Jiangsu Province (CXZZ12_0475) and Innovation Project of Nanjing Institute of
Technology (QKJA201304).
References
1. Laneman, J.N., Tse, D.N.C., Wornell, G.W.: Cooperative diversity in wireless networks:
efficient protocols and outage behavior. IEEE Trans. Inf. Theor. 50(12), 3062–3080 (2004)
2. Hunter, T.E., Nosratinia, A.: Diversity through coded cooperation. IEEE Trans. Wirel.
Commun. 5(2), 283–289 (2006)
3. Liu, P., Tao, Z., Narayanan, S., Korakis, T., Panwar, S.S.: Coopmac: a cooperative mac for
wireless LANs. IEEE J. Sel. Areas Commun. 25(2), 340–354 (2007)
4. Yuan, Y., Zheng, B., Lin, W., Dai, C.: An opportunistic cooperative MAC protocol based on
cross-layer design. In: International Symposium on Intelligent Signal Processing and
Communication Systems, ISPACS 2007, pp. 714–717. IEEE (2007)
5. Zhou, Y., Liu, J., Zheng, L., Zhai, C., Chen, H.: Link-utility-based cooperative MAC
protocol for wireless multi-hop networks. IEEE Trans. Wirel. Commun. 10(3), 995–1005
(2011)
6. Shan, H., Cheng, H.T., Zhuang, W.: Cross-layer cooperative mac protocol in distributed
wireless networks. IEEE Trans. Wirel. Commun. 10(8), 2603–2615 (2011)
A Total Power Control Cooperative MAC Protocol 403
7. Kim, D.W., Lim, W.S., Suh, Y.J.: A robust and cooperative MAC protocol for IEEE
802.11a wireless networks. Wirel. Pers. Commun. 67(3), 689–705 (2012)
8. An, D., Woo, H., Yoo, H., et al.: Enhanced cooperative communication MAC for mobile
wireless networks. Comput. Netw. 57(1), 99–116 (2013)
9. Liu, P., Nie, C., Korakis, T., Erkip, E., Panwar, S., Verde, F., et al.: STicMAC: a MAC
protocol for robust space-time coding in cooperative wireless LANs. IEEE Trans. Wirel.
Commun. 11(4), 1358–1369 (2011)
10. Himsoon, T., Siriwongpairat, W.P., Han, Z., Liu, K.J.R.: Lifetime maximization via
cooperative nodes and relay deployment in wireless networks. IEEE J. Sel. Areas Commun.
25(2), 306–317 (2007)
11. Zhai, C., Liu, J., Zheng, L., Xu, H.: Lifetime maximization via a new cooperative MAC
protocol in wireless sensor networks. In: IEEE Global Telecommunications Conference,
GLOBECOM 2009, Hawaii, USA, pp. 1–6, 30 November–4 December 2009
12. Ibrahim, A., Han, Z., Liu, K.: Distributed energy-efficient cooperative routing in wireless
networks. IEEE Trans. Wirel. Commun. 7(10), 3930–3941 (2008)
13. Razzaque, M.A., Ahmed, M.H.U., Hong, C.S., Lee, S.: QoS-aware distributed adaptive
cooperative routing in wireless sensor networks. Ad Hoc Netw. 19(8), 28–42 (2014)
14. Chen, S., Li, Y., Huang, M., Zhu, Y., Wang, Y.: Energy-balanced cooperative routing in
multihop wireless networks. Wirel. Netw. 19(6), 1087–1099 (2013)
15. Su, W., Sadek, A.K., Liu, K.J.R.: SER performance analysis and optimum power allocation
for decode-and-forward cooperation protocol in wireless networks. In: Proceedings of IEEE
Wireless Communications and Networking Conference, WCNC 2005, New Orleans, LA,
vol. 2, pp. 984–989 (2005)
16. Chen, Y., Zhao, Q.: Maximizing the lifetime of sensor network using local information on
channel state and residual energy. In: Proceedings of the Conference on Information Science
and Systems, CISS 2005. The Johns Hopkins University, March 2005
Parallel Processing of SAR Imaging
Algorithms for Large Areas Using Multi-GPU
1 Introduction
Synthetic Aperture Radar (SAR) is a kind of active microwave imaging radar, which
can generate similar images to optical imaging in the condition of low visibility. Due to
its capability of all-day, all-weather, SAR technique has been currently a hot topic in
the area of Photogrammetry and Remote Sensing. It has huge potential in military,
agricultural, forestry, ocean and other fields [1–3]. However, the high resolution SAR
system has heavy computation since it receives a huge amount of data and needs
complex computation to generate the final images. Some applications have high
real-time requirements that lead to higher request to SAR system.
At present, the post processing of SAR image use workstations or giant servers
which are based on Central Processing Units (CPU) and is rather time-consuming.
Many real-time SAR systems are designed with special DSP and FPGA [4–7]. All of
the above need complex programming and expensive hardware devices. At the same
time, the increased resolution request of SAR system which causes the rapid growth of
computational time.
Graphics Processing Unit (GPU), which was once a highly specialized device
designed exclusively for manipulating image data but has grown into a powerful
2 Back Ground
Regard raw SAR data as a complex matrix that the rows represent range direction
and the columns represents azimuth direction. The principal operations of range
direction include vector multiplication, Fast Fourier Transform (FFT)/Inverse Fast
Fourier Transform (IFFT) and interpolation. Azimuth direction operations are primarily
FFT/IFFT. Normalization operation is needed after each IFFT. Both operations in
range and azimuth direction have high parallelism. Each direction can be divided into
small data block at rows or columns level (Fig. 1). The datum blocks with low coupling
can be processed in parallel on GPU.
Nr Nr
Na
Na
In the design of this paper, we assume that the raw SAR data are always stored in
range direction in host memory and the final images are also stored in range direction.
Owing to the conversion between range direction and azimuth direction during the
imaging processing procedure, multiple transposed operations (4 times in this paper)
are needed. When processing in range direction, cudaMemcpy()/cudaMemcpy Async()
is called to transmit data between the CPU and GPU. CudaMemcpy2D()/cudaMem-
cpy2DAsync() is called in the relative condition.
Figure 2 shows the procedure of the three algorithms. They have much the same
algorithm structure roughly and the parallel method is universal when deployed on GPU.
In order to facilitate concurrent execution between host and device, some function
calls are asynchronous: Control is returned to the host thread before the device has
completed the requested task. CUDA applications manage concurrency through
streams. A stream is a sequence of commands that execute in order. Different streams
may execute their commands out of order with respect to one another or concurrently.
CUDA stream and Asynchronous Concurrent Technology are used cooperatively.
Some devices of compute capability 1.1 [9] or higher can perform copies between
page-locked host memory and device memory concurrently with kernel execution.
Benefiting from these two features, the data transfer time and kernel execution time
are successfully overlapped in our implementation.
Chirp scaling
Reference function
RCMC Rage FFT
multiplication
Step3 RCMC
Stolt interpolation
Range IFFT Range IFFT
Range IFFT
Phase correction
further divided into smaller ones by range/azimuth direction, which we call cell-data
here. Multiple CUDA streams (3 in this paper) are created to process the cell-data.
Combing with Asynchronous Concurrent Technology, the transmission time between
CPU and GPU is fully overlapped and the effect graph is shown is Sect. 4 (Figs. 5 and 6).
Figure 3 shows the cell-data processing on each stream.
Mc
cr ¼ :
16 Na
Mc
ca ¼ :
16 Nr
thread3
thread2
thread1
thread0
1ST Phase function 1ST Phase function 1ST Phase function 1ST Phase function
3RD Phase function 3RD Phase function 3RD Phase function 3RD Phase function
cudaMemcpyAsync
Sync
Divide data along Azimuth
cudaMemcpy2DAsync
transpose transpose transpose transpose
Azimuth IFFT Azimuth IFFT Azimuth IFFT Azimuth IFFT
transpose transpose transpose transpose
cudaMemcpy2DAsync
(1) Divide the raw SAR data into 4 pieces in azimuth direction. Each one is labelled
as subDataNai (i represents the number of GPU). Then sending the starting
address of subDataNai to GPUi .
(2) subDataNai is further divided into cell-data in threadi . Each cell-data is labeled as
cellj (j ¼ 0; 1; 2; 3. . . according to the order of storage in memory). The function
of cuadMemcpy2DAsync() is called to transmit cell-data to GPUi by 3 streams.
cellj belongs to streaml ðl ¼ j%3Þ. Then azimuth FFT and transposition are
410 X. Wang et al.
This paper implements RDA, CSA and xKA on CPU, single GPU, 2 and 4 GPU
respectively. All the implementations adopt double precision floating point type and
have the same processing accuracy. The size of SAR simulation data is 1G
(8192 × 8192), 2G (16384 × 8192), 4G (32768 × 8192), 8G (32768 × 16384)
respectively. Table 1 shows the parameters of experiments.
three streams that the time of kernel computation and data transmission is completely
overlapped except the transmission time of the first cell-data and the last one. Figures 5
and 6 are the typical examples in the executing procedure of the three algorithms.
Figure 5 represents the case that transmission time is greater than the calculation time
of kernels. Figure 6 shows the opposite case. Step 1, 2, 3, 4 (Fig. 2) of RDA belongs to
Fig. 5. Step 1, 2, 4 of CSA and xKA belong to Fig. 5. Step 3 of CSA and xKA belongs
to Fig. 6. Step 1, 2, 4 contains only FFT/IFFT and complex multiplication and the
execution is very short. Stolt interpolation in xKA and three times of phase multipli-
cation in CSA have longer execution time than transmission.
S
parallel efficiency ¼
N
As the increasing of the data size, the parallel efficiency increased gently. CSA and
xKA can reach 1.96, while RDA can only reach 1.75 on 2 GPU. On 4 GPU, CSA and
xKA could reach 3.84 and RDA reached 3.13. The reason is that the execution time of
414 X. Wang et al.
RDA is very short and the average synchronization time between GPUs account for
8 % according to our statistics. While the effect of synchronization time on CSA and
xKA is very slightly that only 1 %. The parallel method could be extended to more
GPUs without decreasing parallel efficiency.
RDA, CSA and xKA are all reasonable approaches for SAR data to its precision
processing. CSA is more complex and takes longer in its implementation but promises
a better resolution in some extreme cases. CSA is more phase preserving and it avoids
computationally extensive and complicated interpolation used by the RDA. While xKA
is the most complex and time consuming algorithm due to its stolt interpolation.
Nevertheless, it can offer the best quality SAR images in most cases. The detailed
comparison of algorithms for implementation on multi-GPU is shown in Table 4 (the
data size is 8G). Researchers can select the most suitable algorithm according to our
conclusions.
We also put forward some novel ideas about SAR imaging on GPU platform in the
future work. Our experimental results have demonstrated that the actual time request of
SAR imaging could be satisfied when using our on multi-GPU. That is to say the
amount of calculation is no longer a problem thanks to the excellent computational
power of GPU. We could design SAR imaging algorithms from another aspect. Such as
in order to improve the resolution of SAR image, the interpolation points can be
increased in xKA. The precision implementation of RDA that combines with RCMC
interpolation on GPU is no longer a complex question.
References
1. Soumekh, M.: Moving target detection in foliage using along track monopulse synthetic
aperture radar imaging. IEEE Trans. Image Process. 6(8), 1148–1163 (1997)
2. Koskinen, J.T., Pulliainen, J.T., Hallikainen, M.T.: The use of ERS-1 SAR data in snow melt
monitoring. IEEE Trans. Geosci. Remote Sens. 35(3), 60–610 (1997)
3. Sharma, R., Kumar, S.B., Desai, N.M., Gujraty, V.R.: SAR for disaster management. IEEE
Aerosp. Electron. Syst. Mag. 23(6), 4–9 (2008)
4. Liang, C., Teng, L.: Spaceborne SAR real-time quick-look system. Trans. Beijing Inst.
Technol. 6, 017 (2008)
5. Tang, Y.S., Zhang, C.Y.: Multi-DSPs and SAR real-time signal processing system based on
cPCI bus. In: 2007 1st Asia and Pacific Conference on Synthetic Aperture Radar, pp. 661–
663. IEEE (2007)
6. Xiong, J.J., Wang, Z.S., Yao, J.P.: The FPGA design of on board SAR real time imaging
processor. Chin. J. Electron. 33(6), 1070–1072 (2005)
7. Marchese, L., Doucet, M., Harnisch, B., Suess, M., Bourqui, P., Legros, M., Bergeron, A.:
Real-time optical processor prototype for remote SAR applications. In: Proceedings of
SPIE7477, Image and Signal Processing for Remote Sensing XV, pp. 74771H–74771H
(2009)
8. Cumming, I.G., Wong, F.H.: Digital Processing of Synthetic Aperture Radar Data:
Algorithms and Implementation. Artech House, Norwood (2005)
9. Zhang, S., Chu, Y.L.: GPU High Performance Computing: CUDA. Waterpower Press,
Bejing (2009)
10. Meng, D.D., Hu, Y.X., Shi, T., Sun, R.: Airborne SAR real-time imaging algorithm design
and implementation with CUDA on NVIDIA GPU. J. Radars 2(4), 481–491 (2013)
11. Wu, Y.W., Chen, J., Zhang, H.Q.: A real-time SAR imaging system based on CPUGPU
heterogeneous platform. In: 11th International Conference on Signal Processing, pp. 461–
464. IEEE (2012)
12. Malanowski, M., Krawczyk, G., Samczynski, P., Kulpa, K., Borowiec, K., Gromek, D.:
Real-time high-resolution SAR processor using CUDA technology. In: 2013 14th
International Radar Symposium (IRS), pp. 673–678. IEEE (2013)
13. Bhaumik Pandya, D., Gajjar, N.: Parallelization of synthetic aperture radar (SAR) imaging
algorithms on GPU. Int. J. Comput. Sci. Commun. (IJCSC) 5, 143–146 (2014)
14. Song, M.C., Liu, Y.B., Zhao, F.J., Wang, R., Li, H. Y.: Processing of SAR data based on the
heterogeneous architecture of GPU and CPU. In: IET International Radar Conference 2013,
pp. 1–5. IET (2013)
416 X. Wang et al.
15. Ning, X., Yeh C., Zhou, B., Gao, W., Yang, J.: Multiple-GPU accelerated range-doppler
algorithm for synthetic aperture radar imaging. In: 2011 IEEE Radar Conference (RADAR),
pp. 698–701. IEEE (2011)
16. Tiriticco, D., Fratarcangeli, M., Ferrara, R., Marra, S.: Near-real-time multi-GPU wk
algorithm for SAR processing. In: Proceedings of the 2014 Conference on Big Data from
Space, pp. 263–266. Publications Office of the European Union (2014)
An Extreme Learning Approach to Fast
Prediction in the Reduce Phase
of a Cloud Platform
1 Introduction
MapReduce (MR) [1] has become the most popular distributed computing model used
in a cloud environment, where large-scale datasets can be handled/processed using map
and reduce procedures in the cloud infrastructure transparently. Two types of nodes are
maintained in a cluster based on the MR framework; they are JobTracker and Task-
Tracker nodes. The Jobtracker, which runs on the data node, coordinates MapReduce
jobs.
The MR, as well as loud computing has become a hotspot in the academia [2].
Many people try to optimize it. Proposals in [3–6] may predict the execution states of
mapreduce, but they cannot precisely predict it. In this paper, a novel prediction model
based on the ELM algorithm is proposed to facilitate the execution of reduce operations
in a cloud environment.
The rest sections are organized as followed. Related work is given in Sect. 2,
followed by Sect. 3, where our prediction approach is detailed. In Sect. 4, testing
environment and corresponding scenarios are design for the verification and evaluation.
Finally, conclusion and future work are discussed in Sect. 5.
2 Related Work
Offline or online profiling has been proposed by previous work to predict application
resource requirements by using benchmarks or real application workloads. Wood et al.
[3] designed a general approach to estimate the resource requirements of applications
running in a virtualized environment. They profiled different types of virtualization
overhead and built a model to map file in the local system into the virtualized system.
Their model focused on relating the resource requirements of real hardware platform to
the virtual one. Islam et al. [4] studied the changing workload demands by starting new
VM instances, and proposed a prediction model for adaptively resource provisioning in
a cloud. Complex machine learning techniques were proposed in [5] to create accurate
performance models of applications. They estimated the usage state of resource by an
approach named PQR2. Jing et al. [6] presented a model that can predict the computing
resource consumption of MapReduce applications based on a Classified and Regres-
sion Tree.
Artificial neural networks (ANNs), as an effective method have been widely applied in
applications involving classification or function approximation [7]. However, the
training speed of ANNs is much slower than what a practical application needs. In
order to overcome this drawback, the approximation capability of feed is employed to
advance neural networks, especially in a limited training set. One of the most important
achievement of this work is putting forward a novel learning algorithm in single hidden
layer feed forward neural network (SLFNs) [8], i.e. ELM [8–13].
In Algorithm 1, the size of training set, as well as the time of iteration and execution
is collected as input parameters. The number L is generated as the output.
(a) randomly generate the weights between input layer and hidden layer, where
hidden layer neurons w and the threshold b are set;
(b) calculate the output matrix H of hidden layer;
(c) work out output layer weights.
Step 3: Data validation. Use the data generated in Step 1 to validate the NO-ELM
prediction model. According to the parameters trained in step 2 to get the predictive
value of test set, and compare with the actual value to verify prediction performance of
the model.
For the model to predict the number of reducers, the data format is set as {re-
ducer_no, execution_time, input_data_vol}. Under the default circumstance, the pre-
diction model recommends the number of reducers that can complete the task as soon
as possible. The input format can then be simplified as {reducer_no, input_data_vol}.
If the complete time of a task needs to be specified, the prediction model will rec-
ommend corresponding number of reducers. For doing that, the input format is as
{execution_time, input_data_vol}.
In order to test the performance the new prediction model, a practical Hadoop envi-
ronment was built consisting of personal computers and a server. Each personal
computer has 12 GB of memory, a single 500 GB disk and dual-core processors. The
server is equipped with 288 GB of memory, a 10 TB SATA driver. Eight virtual
instances are therefore created in the server with same specification as personal com-
puters, i.e. the same amount of memory and storage space, as well as the same number
of processors. In terms of role configuration, the server suns as the name node, whilst
the virtual machines and personal computers run as the data nodes.
A shared open dataset [6] was manipulated as the input workload containing 26 GB
of text files. The dataset was further separated into 5 groups for testing purposes.
A K-means (KM) clustering algorithm provided by Purdue MR Benchmarks Suite was
used for partitioning operation in the cloud platform.
Before training the NO-ELM prediction model, the samples are prepared following
the equation below in order to meet the requirement of the model:
where s is the mean value of sample series, st is the value of one sample. Here, we
remove st from the samples if ht is greater than 5 % and st is greater than s considering
the cases where these samples may be affected by the network congestion.
The sample data are then normalized following the equation below:
where smin is the minimum value of sample series, smax is the maximum value of
samples. After normalization, the variation range of sample data is [0, 1].
An Extreme Learning Approach to Fast Prediction 421
5 Conclusion
In this paper, an extreme learning machine with the number of hidden neurons opti-
mized (NO-ELM) has been introduced to analyze and predict the data. The NO-ELM
method has been implemented in a real Hadoop environment, where the SVM algo-
rithm has also been replicated for comparison purposes. Through the results, the
NO-ELM has depicted better performance in the prediction of execution time and the
number of reducers to be used.
References
1. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun.
ACM 51(1), 107–113 (2008)
2. Fu, Z., Sun, X., Liu, Q., Zhou, L., Shu, J.: Achieving efficient cloud search services:
multi-keyword ranked search over encrypted cloud data supporting parallel computing.
IEICE Trans. Commun. E98-B(1), 190–200 (2015)
3. Wood, T., Cherkasova, L., Ozonat, K., Shenoy, P.D.: Profiling and modeling resource usage
of virtualized applications. In: Issarny, V., Schantz, R. (eds.) Middleware 2008. LNCS, vol.
5346, pp. 366–387. Springer, Heidelberg (2008)
4. Islam, S., Keung, J., Lee, K., Liu, A.: Empirical prediction models for adaptive resource
provisioning in the cloud. Future Gener. Comput. Syst. 28(1), 155–162 (2012)
5. Matsunaga, A., Fortes, J.A.B.: On the use of machine learning to predict the time and
resources consumed by applications. In: Proceedings of the 2010 10th IEEE/ACM
International Conference on Cluster, Cloud and Grid Computing, pp. 495–504. IEEE
Computer Society (2010)
6. Piao, J.T., Yan, J.: Computing resource prediction for MapReduce applications using
decision tree. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu, G. (eds.) APWeb 2012. LNCS,
vol. 7235, pp. 570–577. Springer, Heidelberg (2012)
7. Oong, T.H., Isa, N.A.: Adaptive evolutionary artificial neural networks for pattern
classification. IEEE Trans Neural Networks 22, 1823–1836 (2011)
8. Huang, B., Zhu, Q.Y., Siew, C.K.: Extreme learning machine: theory and applications.
Neurocomputing 70, 489–501 (2006)
9. Samat, A., Du, P., Liu, S., Li, J., Cheng, L.: E2LMs: ensemble extreme learning machines
for hyperspectral image classification. IEEE J. Sel. Topics Appl. Earth Observ. Remote
Sens. 7(4), 1060–1069 (2014)
10. Bianchini, M., Scarselli, F.: On the complexity of neural network classifiers: a comparison
between shallow and deep architectures. IEEE Trans. Neural Netw. Learn. Syst. 1553–1565
(2013)
11. Wang, N., Er, M.J., Han, M.: Generalized single-hidden layer feedforward networks for
regression problems. IEEE Trans. Neural Netw. Learn. Syst. 26(6), 1161–1176 (2015)
12. Giusti, C., Itskov, V.: A no-go theorem for one-layer feedforward networks. IEEE Trans.
Neural Netw. 26(11), 2527–2540 (2014)
13. Huang, G., Zhou, H., Ding, X., Zhang, R.: Extreme learning machine for regression and
multiclass classification. IEEE Trans. Syst. Man Cybern. B Cybern. 42(2), 513–529 (2012)
Data Analysis in Cloud
Wind Speed and Direction Predictions Based
on Multidimensional Support Vector
Regression with Data-Dependent Kernel
Abstract. The development of wind power has a higher requirement for the
accurate prediction of wind. In this paper, a trustworthy and practical approach,
Multidimensional Support Vector Regression (MSVR) with Data-Dependent
Kernel(DDK), is proposed. In the prediction model, we applied the longitudinal
component and lateral component of the wind speed, changed from original
wind speed and direction, as the input of this model. Then the Data-Dependent
kernel is instead of classic kernels. In order to prove this model, actual wind data
from NCEP/NCAR is used to test. MSVR with DDK model has higher accuracy
comparing with MSVR without DDK, single SVR, Neural Networks.
1 Introduction
As everyone knows, every walk of life is closely related to the wind. Therefore,
accurate prediction of wind speed and direction is important and necessary. However,
wind as a meteorological factor is an intermittent and non-dispatchable energy source,
and then this research work is not going very well.
In literatures, many different techniques for predicting wind speed have been
researched. The first is the physical method, which is used worldwidely for the
large-scale weather [1]. Physical method does not apply to short-term prediction
because it must use a long time to correction. The second is statistical method,
including persistence method [2], time series [3], Kalman filter [4] and Grey forecasting
[5]. Differ from physical method, it finds historical data’s relevance to predict wind
speed regardless of the wind physical speed. The third is learning method, like neural
networks [6] and support vector regression [7]. The essence of learning method is
extract the relationship between the input and output using artificial intelligence
method instead of describing in the form of analytical expression.
In the above method, most studies concentrate on the wind speed prediction. How-
ever, direction is also an important influence factor to wind power. Forecasting the wind
speed and direction simultaneously is more reasonable than forecasting respectively.
2 The Method
X
n X
l
min jjwj jj2 þ C n2i
j¼1 i¼1
ð1Þ
s:t:jjyi W /ðxi Þ Bjj e þ ni
ni 0
To SVR, Eq. (1) could be solved by formulated in the dual space. But the method
can’t be used here directly since the inequality constraint is not affine. Therefore, a loss
function defined in the hypersphere is lead up.
(
0 ze
Le ðzÞ ¼ 2
ð2Þ
ðz eÞ z[e
After the above loss function introduced, the programming problem can be solved
directly in the original space. The training of MSVR can be equivalently rewritten as an
unconstrained optimization problem
X
n X
l
min jjwj jj2 þ C Le ðjjyi W /ðxi Þ BjjÞ ð3Þ
j¼1 i¼1
Defining
2 @
bi;j ¼ Le ðjjyi W /ðxi Þ BjjÞ ð4Þ
C @wj
Wind Speed and Direction Predictions 429
then
X
l
wj ¼ /ðxi Þbi;j ð5Þ
i¼1
X
n X
l X
l X
l
mink ð bi;j bp;j kðxi ; xp ÞÞ þ Le ðjjyi bp kðxi ; xp Þ BjjÞ ð6Þ
j¼1 i;p¼1 i¼1 p¼1
~ x0 Þ ¼ cðxÞcðx0 ÞKðx; x0 Þ;
Kðx; ð7Þ
At the moment, if we choose RBF as the original kernel function, the scale factor
can be written as
sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
pffiffiffiffiffiffiffiffiffi hn nr2 =2s2 r2
~gðxÞ ¼ in e i 1 þ 4 r2 ; ð9Þ
r si
430 D. Wang et al.
pffiffiffiffiffiffiffiffiffi
where r = ||x - xi||. From Eq. (9), the value of ~gðxÞ is smaller around xi when hi < σ
pffiffiffi pffiffiffiffiffiffiffiffiffi
and s\r= n. Then ~gðxÞ is controlled by changing hi. That is to say, we can
compress the space by regulating hi to improve the accuracy of regression.
3 Model
In this paper, a model used to predict the wind speed and direction is proposed, and it is
shown as Fig. 1. In feature extraction, we choose wind speed, wind direction, air
temperature, pressure, potential temperature, precipitable water, relative humidity and
sea level pressure as features of data set.
f ( x) = W ⋅ ϕ ( x) + B
P
l P
l
where, S ¼ sin hi and C = cos hi θ is the angle between wind vector and north
i¼1 i¼1
vector.
Wind Speed and Direction Predictions 431
Secondly, the wind direction vector is divided into lateral component and longi-
tudinal component. It can be achieved by the following formulas:
Dividing the wind speed into two components with above method can link wind
speed and wind direction together more closer and the relationship can appear in the
process of multi-output SVR.
X
l
Lðbaug Þ ¼ kbTaug K aug baug þ Le ðjjyi K Ti baug jjÞ: ð12Þ
i¼1
@Lðbaug Þ
Using the Newton-Raphson method to minimize the Eq. (12), G ¼ @baug ,
@ 2 Lðbaug Þ
H¼ @ 2 baug . We can get the convergent result through the iterative formula:
1
bnew
aug ¼ baug aH G;
old
ð13Þ
X
Msv
~ aug baug þ
Lðbaug Þ ¼ kbTaug K ~ baug jj eÞ2 ;
ðjjyi K
T
ð14Þ
i
i¼1
Where, Msv is the set of support vector when ||yi - KTi βaug|| > ɛ.
^aug can be gained. Then the needed variables W and
Finally the convergent result b
B can be calculated. Hence the regression model can be got.
The process of MSVR with Data-Dependent kernel is divided into two parts, the
training samples set and test samples set.
The training and prediction process is described as follows:
Training Input: data set X;
Step 1: Choosing the classic kernel Kas the original kernel and using X to learn a
regression model with Eq. (12), we get Msv and βaug;
432 D. Wang et al.
Step 2: Calculating the c(x) with Eq. (8) utilizing the information of Msv, we get
Data-Dependent kernel K; ~
Step 3: Relearning a regression model with Eq. (14), the Msv and βaug are updated;
Step 4: If βaug is not iterative, repeat from Step2 to Step 3. Otherwise calculating
the variables W and B;
Training Output: regression model.
Prediction Input: test samples x;
Step 1: Bringing x to regression model with the last Msv, we get the predicted
value;
Prediction Output: the predicted value y correspongding to x.
4 Experiment
Fig. 2. Wind speed and direction during 1st January - 31th December
We can see the difference of magntiudes between wind speed and direction is large.
Hence, the Eq. (10), (11) can change them to two wind speed components. By the
calculation, the value of h is clockwise 210.6213° with the north axis. So the set of data
can be reconstructed to the form like Fig. 3.
After the data preprocessing, the train process with MSVR with Data-Dependent
kernel can be begun. The first 1100 data points is arranged to be training data,
meanwhile, the next 100 data points is used to be testing data. We do it following the
method mentioned in the last paragraph.
Wind Speed and Direction Predictions 433
In the model, the parameter C is set to [1.00e + 003 1.00e + 005]. Here RBF kernel
is chosen as the original kernel and the kernel parameter is set to [7500 1000]. Besides,
the iteration step α and the iterative threshold value in the Newton-Raphson method are
set to 1 and 0.001 respectively. The predicting result shows in Fig. 4.
The way of reasonable error analyze is very significant to result evaluation. This
paper selects the root-mean-squares of errors (RMSE) and the mean absolute per-
centage error (MAPE) as the evaluation indicators.
" #
1 X M
jyi f ðxi Þj
MAPE ¼ 100%
M i¼1 yi
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð15Þ
u M
u1 X
RMSE ¼ t ½yi f ðxi Þ2
M i¼1
We can know the errors of the predicted longitudinal component and lateral
component wind speed used MSVR with Data-Dependent kernel from Table 1.
Table 1. Errors of two wind speed components
Component MAPE(%) RMSE(m/s)
Longitudinal component 10.42 0.0390
lateral component 10.16 0.0301
434 D. Wang et al.
At present, the predicting error of wind speed time series can be almost 20 %. So the
error 10.42 % and 10.16 % is rather small and the model MSVR with Data-Dependent
kernel perform well. However, it may be a coincidence. Hence, we make several other
methods to compare their performance to prove the ability of our method in predicting
the wind speed and direction. Here, we choose three other methods, MSVR without
Data-Dependent kernel, single SVR, Neural Network. These methods are all learning
methods. Comparison with MSVR without Data-Dependent kernel can get the effect of
data dependent kernel. Comparison with single SVR can prove the ideal forecasting
simultaneously is right. Neural Network is used to compare with SVR and here we also
choose it to contrast. Figure 5 shows their performance and Table 2 lists their prediction
error.
15 300
real data
MSVR with DDK
MSVR without DDK 250
wind direction(°)
wind speed(m/s)
10
200
5
150
0 100
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100
(a) MSVR with DDK and MSVR without DDK
15 300
real
wind direction(°)
10 single SVR
200
5
150
0 100
0 20 40 60 80 100 0 20 40 60 80 100
(b) MSVR with DDK and single SVR
15 300
real data
wind speed(m/s)
wind direction(°)
0 100
0 20 40 60 80 100 0 20 40 60 80 100
(c) MSVR with DDK and NN
From the error values in Table 2, we can see clearly that the model of MSVR with
Data-Dependent kernel has relative small prediction error. The contrast experiment
proves our model’s ability in predicting the wind speed and direction.
5 Conclusion
In this paper, the MSVR with Data-Dependent kernel model for forecasting wind speed
and direction at the same time is built. The next six-hour wind in the farm is predicted
by four models, including our model. The error analysis shows that MSVR with
Data-Dependent kernel model is the better model for wind speed and direction fore-
casting, due to the lowest MAPE (16.81 %, 18.59 %) and the RMSE (0.2246, 11.4595),
4-5 points lower than the second-lowest MSVR without Data-Dependent kernel.
According the data, we can see the improvement of data dependent kernel. And it
proves that forecasting two factors at the same time can get better result than respec-
tively. Thus, it is an effective model for the wind speed and direction forecasting.
Acknowledgments. This work was supported in full by the Natural Science Foundation of
JiangSu Province No. BK2012858, and supported in part by the National Natural Science
Foundation of China under grant numbers 61103141.
References
1. Negnevitsky, M., Potter, C.W.: Innovative short-term wind generation prediction
techniques. In: Power Engineer in Society General Meeting, pp. 60-65. IEEE Press,
Montreal (2006)
2. Han, S., Yang, Y.P., Liu, Y.Q.: Application study of three methods in wind speed
prediction. J. North China Electric Power Univ. 35(3), 57–61 (2008)
3. Firat, U., Engin, S.N., Saralcar, M., Ertuzun, A.B.: Wind speed forecasting based on second
order blind identification and autoregressive model. In: International Conference on Machine
Learning and Applications (ICMLA), pp. 686-691 (2010)
4. Babazadeh, H., Gao, W.Z., Lin, C., Jin, L.: An hour ahead wind speed prediction by Kalman
filter. In: Power Electronics and Machines in Wind Applications (PEMWA), pp. 1-6 (2012)
5. Huang, C.Y., Liu, Y.W., TZENG W.C., Wang, P.Y.: Short term wind speed predictions by
using the grey prediction model based forecast method. In: Green Technologies Conference
(IEEE-Green), pp. 1-5 (2011)
6. Ghanbarzadeh, A., Noghrehabadi, A.R., Behrang, M.A., Assareh, E.: Wind speed prediction
based on simple meteorological data using artificial neural network. In: IEEE International
Conference on Industrial Informatics, pp. 664-667 (2009)
7. Peng, H.W., Yang, X.F., Liu, F.R.: Short-term wind speed forecasting of wind farm based
on SVM method. Power Syst. Clean Energy 07, 48–52 (2009)
8. Wang, D.C., Ni, Y.J., Chen, B.J., Cao, Z.L.: A wind speed forecasting model based on
support vector regression with data dependent kernel. J. Nanjing Normal Univ. (Nat. Sci.
Edn.) 37(3), 15–20 (2014)
436 D. Wang et al.
9. Bin, G., Victor, S.: Feasibility and finite convergence analysis for accurate on-line v-support
vector machine. IEEE Trans. Neural Netw. Learn. Syst. 24(8), 1304–1315 (2013)
10. Crotes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995)
11. Wu, I., Amari, S.: Conformal transformation of kernel functions: a data-dependent way to
improve support vector machine classifiers. Neural Process. Lett. 15(1), 59–67 (2002)
Research on Rootkit Detection Model Based
on Intelligent Optimization Algorithm
in the Virtualization Environment
Abstract. In order to solve the problems that the high misjudgment ratio of
Rootkit detection and undetectable unknown Rootkit in the virtualization guest
operating system, a Rootkit detecting model (QNDRM) based on intelligent
optimization algorithm was proposed. The detecting model combines neural
network with QPSO, which can take advantage of them. In the actual detection,
QNDRM firstly captures the previously selected out Rootkit’s typical charac-
teristic behaviors. And then, the trained system detects the presence of Rootkit.
The experimental results show that QNDRM can effectively reduce the mis-
judgment ratio and detect both known and unknown Rootkit.
1 Introduction
2 Related Work
Attackers update attack technology constantly and develop the new Rootkit to attack
the virtual machine in order to achieve their purpose. As a result, it is very important to
improve the ability for detecting the unknown Rootkit. The artificial intelligence
method of the system engineering can solve this problem very well. In recent years,
a variety of artificial intelligence methods are applied to detect Rootkit.
Lu [3] studies a kind of using artificial immune algorithm to detect malicious
program technology. The technology based on artificial immune optimization is
applied to detect the malicious programs in computer and mobile phone. It improves
the detection rate. Zhang [4] applies the artificial immune algorithm to the malicious
code detection based on IRP in order to improve the efficiency and accuracy of the
malicious code detection. Pan [5] studies the killing of malicious code technology
based on expert system. This technology can accurately detect malicious behavior
information existing in the knowledge library. But the detection rate of unknown
malicious code is low, and the ability of acquiring knowledge is poor. Shirazi H M [6]
applies the genetic algorithm to the malicious code detection system. It reduces the
misjudgment rate. Abadeh M S [7] applies the ant colony algorithm to optimize
the fuzzy classification algorithm in intrusion detection, and it effectively improves the
classification precision and the accuracy of intrusion detection. Dastanpour A [8]
achieves the purpose of detecting malicious code by prediction based on combining
genetic algorithm and neural network. The method can effectively detect the samples in
the CPU data.
3 Architecture
QNDRM detecting model inputs the Rootkit behavior features encoded through
quantitative module to the BP network. The Rootkit behavior features encoded through
quantitative module are stored in BP network through training, and they are as the
expert system knowledge library. Finally, QNDRM completes the purpose of detecting
the unknown Rootkit.
Cybenko proved that only a single hidden layer can distinguish any classification
problem when each neuron using S function in 1988 [9]. BP neural network having
only three layers can achieve arbitrary n to m dimensional vector map, while it uses
tangent sigmoid transfer function between two connected layers and the hidden layer
has enough neurons [10]. Therefore, the paper determines that QNDRM’s transfer
function uses the S tangent function and the network structure is 3 layers. QNDRM
detection model structure based on BP neural network is shown in Fig. 1.
QNDRM detecting model includes quantitative module, subject learning inference
engine (automatic knowledge acquisition, knowledge library and inference engine) and
decoding module. Firstly, QNDRM detecting model codes the behavioral character-
istics through the quantitative module. Then, the model implicitly expresses the
behavioral characteristics as the weight value and threshold value of the network
structure and indirectly expresses expert knowledge. Finally, the model stories the
knowledge in the network structure as the expert knowledge library. QNDRM realizes
Research on Rootkit Detection Model 439
I1
Quantitative
O1
Decoding
module
j
module
Behavioral I2 Output
Tr
characteristics On signal
In
Wir Wrj
the reasoning mechanism based on the structure of the neural network, outputs
non-numerical information through the decoding module, completes the Rootkit
detection process.
ui ¼ hRi ; Ni i ð2Þ
Conðui Þ
vi ¼ P ð5Þ
Conðui Þ
However there is a certain error between actual output value and ideal output value
of QNDRM, namely, the actual output value is not absolute “0” or “1”. Therefore, we
need to define a range of error. When the actual output is within the error range, we
believe that the output is “0” or “1”.
the complementary advantages between them to avoid slow convergence speed, poor
optimization precision and local minimum problem. That can greatly improve the
training effect.
(1) Improved QPSO Algorithm.
The inertia weight is constant in standard QPSO algorithm, and this cannot effectively
response the actual application. For the above situation, the random inertia weight b is
put forward, and b is changed through the evolution algebra and random. In the
algorithm, b is changed by the following formula:
randð Þ Gcurrent
b ¼ bmax 0 randðÞ 1 ð6Þ
2 Gmax
Among them, randð Þ is the random number between 0 and 1. bmax is the maxi-
mum inertia weight, bmax ¼ 0:9. Gcurrent is the current generation, Gmax is the maxi-
mum generation.
When the paper thinks randð Þ is a constant, Gcurrent will increase and b will
decrease linear. Therefore, in the beginning, Gcurrent is the smaller, and Gcurrent =Gmax is
the smaller, and b is the bigger. So the algorithm has strong global search ability, and
gets proper particles, and is conducive to jump out of local minimum point in the
beginning. When Gcurrent increases, Gcurrent =Gmax is the bigger, and b is the smaller.
This increases the local search ability of the algorithm so that we can make particles
gradually shrink to good areas to more detailed search. At the same time it can improve
the convergence speed. randð Þ is random changed in fact. The overall trend is linear
regressive, but b is changed nonlinear. This shows a complex change process and
avoids the premature convergence phenomenon.
(2) Steps of Inference Engine.
The specific steps of inference engine are as follows:
Step1: Firstly, the BP network structure is determined and the training sample set is
given.
Step2: The population size, the particle dimension and the number of iterations of
the improved QPSO are set based on the BP network structure. Each particle represents
a set of neural network parameters;
Step3: Initialization population of particles, and then setting the mean square error
of BP network to particle fitness function.
Step4: Decoding each particle to obtain corresponding weight values and the
threshold values of BP network;
Step5: The sample is input in QNDRM for obtaining the corresponding output, and
then the fitness value of each particle is calculated through the mean square error
function.
Step6: Comparing the mean square error E and the expected value e. If E\e,
jumping to step 9, otherwise jumping to the step 7.
Step7: According to the algorithm to update the best position of particles and the
global best position.
442 L. Sun et al.
Step8: Increased the number of iterations k, if k [ kmax , executing the next step,
otherwise jumping to step 4.
Step9: To map the global optimal particle for structure parameters of BP network,
and parameters is the final results of the optimization.
Step10: End of the algorithm (Fig. 2).
Start
N
Map neural network weight and
threshold values for the particles k > k max
Y
End
Rootkit must be loaded into memory and leave a variety of behaviors for playing their
functions. It is hopeful that find Rootkit through analyzing the behavior of traces in the
system. In the process of the actual detection, behavior characteristics of Rootkit must
be chosen for the input of QBDRM. To make a smaller amount of calculation, having a
faster response speed, having a higher detection rate, the paper selects the represen-
tative and effective behavior characteristics of Rootkit based on the KDD CUP99
dataset [11] as the input of QNDRM. The KDD CUP99 dataset is the most authori-
tative testing data sets, and there are about 5 million records in it (Table 2).
There is no specific reference formula for solving the number of hidden layer
neurons, so we can only accord to the characteristics of the network structure and the
work of others to estimate the number of hidden layer neurons. Then the best value is
selected according to the order of the valuation range through the experiment, and this
can improve the efficiency of selection. In the network with three layers, two empirical
formulas are used the most frequently as follow [12]:
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Nhidden ¼ Nin Nout ð7Þ
The paper has identified that Nin is 18 and Nout is 1. So according to the above two
formulas, Nhidden is:
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
Nhidden ¼ Nin Nout 4 ð9Þ
According to the results of the above two formulas, the number of neurons in
hidden layer is between 4 to 10. Under the condition of without considering the other
parameters change, the number of hidden layer neurons within the scope of 4 to 10 is
experimented, as shown in Table 3.
4 0.00426281
5 0.00120495
6 0.00064653
7 0.00026821
8 0.00024276
9 0.00023672
10 0.00024012
Before the node number is 6, the mean square error and the error percentage has
obvious drop. After the node number is 8, the trend of the value slowed down. It is
considered that the number of hidden layer would increase the complexity of the cal-
culation, therefore, this paper determines that the number of hidden layer neurons is 7.
Detailed parameters are shown in Table 4.
Research on Rootkit Detection Model 445
is shown in Fig. 3 with smooth curve. Rootkit is implanted at 19 s, and the output value
changes to 1. So QNDRM can effectively detect known and unknown Rootkit.
0.8
0.6
0.4
0.2
0
t
d
t
e
ng
pu
pu
ne
ea
on
t
ip
hi
gh
gh
to
rh
tst
cr
itc
ys
ve
ou
ou
he
lS
w
hr
lO
hr
hr
W
tS
el
D
lT
eT
l
Sh
Ca
ex
ec
nt
Pi
m
Ex
Co
ste
Sy
As shown in Fig. 4, the Y coordinate represents the ratio of running each detecting
system and without detecting system, and the X coordinate represents the test items.
The method in this paper is better than the other methods in terms of performance loss.
When QNDRM is running, the average performance loss is 5.7 %.
5 Conclusion
In this paper we have presented a novel approach of detecting Rootkit. To the best of
our knowledge, there are no papers about such a kind of study to have been published.
The Rootkit heuristic detecting model based on intelligent optimization algorithm in
virtualization environment was proposed. The model improved the ability to detect
known and unknown Rootkit and enhanced the security of the virtual machine by
combining neural network and QPSO. Finally, it was proved that the detecting model
could effectively find the Rootkit through the experiment. In a word, the method solves
the problem in the virtualization environment by using intelligent optimization algo-
rithm, which provides a new way for detecting Rootkit.
References
1. Vivek, K.: Guide to Cloud Computing for Business and Technology Managers: From
Distributed Computing to Cloudware Applications. CRC Press, Boca Raton (2014)
2. Hoglund, G., Butler, J.: Rootkits: Subverting the Windows Kernel. Addison-Wesley
Professional, Reading (2006)
3. Lu, T.: Research on Malcode Detection Technology Based on Artificial Immune System.
Beijing University of Posts and Telecommunications (2013)
4. Zhang F.: Research on Artificial Immune Algorithms on Malware Detection. South China
University of Technology (2012)
5. Jianfeng, Pan: Design and Implemetation of Host-Based Malcode Detection System.
University of Science and Technology of China, Anhui (2009)
6. Shirazi, H.M.: An intelligent intrusion detection system using genetic algorithms and
features selection. Majlesi J. Electr. Eng. 4(1), 33–43 (2010)
7. Abadeh, M.S., Habibi, J.A.: Hybridization of evolutionary fuzzy systems and ant colony
optimization for intrusion detection. ISC Int. J. Inf. Secur. 2(1), 33–46 (2015)
8. Dastanpour, A., Ibrahim, S., Mashinchi, R.: Using genetic algorithm to supporting artificial
neural network for intrusion detection system. In: The International Conference on
Computer Security and Digital Investigation (ComSec2014). The Society of Digital
Information and Wireless Communication, pp. 1–13 (2014)
9. Yuan, X., Li, H., Liu, S.: Neural Network and Genetic Algorithm Apply in Water Science.
China Water & Power Press, Beijing (2002)
10. Zhu H.: Intrusion Detection System Research Based on Neural Network. Shandong
University (2008)
11. Wan, T., Ma, J., Zeng, G.: Analysis of sample database for intelligence intrusion detection
evaluation. South-Central Univ. Nationalities 2(29), 84–87 (2010)
12. Debar, H., Becker, M., Siboni, D.: A neural network component for an intrusion detection
system. In: Proceedings of the 1992 IEEE Computer Society Symposium on Research in
Security and Privacy, pp. 240–250. IEEE (1992)
Performance Analysis of (1+1)EA
on the Maximum Independent Set Problem
Xue Peng(B)
1 Introduction
Afterwards, the running time analysis of EAs turns to some classic combi-
natorial optimization problems in P class such as the maximum matching [5],
the minimum spanning tree [6], the shortest path problems [7] and so on. These
theoretical studies show that though the EAs can not beat those classic problem-
specific algorithms in general case, it can solve these problems within an expected
running time which is close to those specific algorithms.
In practice, a lot of combinatorial optimization problems are NP-complete.
Complexity theory tells us that unless P = NP, the maximum independent prob-
lem can not be solved by a deterministic polynomial time algorithm. A natural
question is whether we can effectively find their approximation solutions.
Although EAs are a class of global optimization algorithms, we can not expect
that they can solve any instance of NP-complete problems in polynomial time.
Therefore, research and analysis on the approximation performance of EAs will
be a meaningful work. Witt [8] presents the approximation performance of EAs
on an NP-complete problem, the Partition problem. He proved that both the
random local search algorithm and the (1 + 1)EA can obtain an approximation
ratio of 43 in an expected running time O(n2 ). Friedrich et al. [9] analyzed the
approximation performance of a hybrid evolutionary algorithm on the vertex
cover problem. They investigated some special instances to prove that EAs can
improve the approximation solution.
Recently Yu et al. [10] proposed an evolutionary algorithm called SEIP
(simple evolutionary algorithm with isolated population) and proved that for
unbounded set cover nproblem this algorithm can get an approximation ratio of
Hn , where Hn = i=1 1i is a nth harmonic number. For k-set cover problem,
they showed that SEIP can obtain an approximation ratio of Hk − k−1 8k9 . Later
on, Zhou et al. [11–14] extended the approximation performance of EAs to other
combinatorial optimization problems.
In this paper, we analyze a classic combinatorial optimization problem, the
maximum independent set problem (MISP). Given an undirected graph, inde-
pendent set refers to a set of vertices in a graph such that any two vertices in this
set are adjacent. The goal is to find an independent set so that its size is max-
imized. With this paper, we investigate the approximation performance of the
(1 + 1)EA on MISP from a theoretical point of view. We prove that by simulat-
ing the local search algorithm, the (1 + 1)EA can obtain the same approximation
ratio Δ+1
2 as that of the local search algorithm in expected running time O(n4 ),
where Δ denotes the maximum degree in the graph. Further, by constructing an
instance of MISP, we can show that the (1 + 1)EA outperforms the local search
algorithm on it.
The remainder of this paper is structured as follows. In Sect. 2, we introduce
the algorithms, problem and method; Sect. 3 presents the approximation per-
formance of the (1 + 1)EA on MISP; Sect. 4 analyzes the performance of the
(1+1)EA on an instance of MISP; Sect. 5 concludes the paper.
450 X. Peng
2.2 (1+1)EA
The (1 + 1)EA, which uses mutation operator and selection operator, is a sim-
ple and effective evolutionary algorithm and it has a population size of 1. The
description of the (1+1)EA is given as follows:
where edge eij connects vertices vi and vj , that is, eij = (vi , vj ).
Now we introduce the concepts of independent point and independent edge,
which are relative to the subset of A. Independent point is a point that no
edge connects to it or has an edge connecting to it but the point in the other
endpoint of this edgeis in A. nIndependent edge is an edge two endpoints of
n
which are not in A. i=1 xi j=1 xj eij in formula (1) is used to compute the
number
n of non-independent edges contained in the current solution . Note that
n
i=1 xi j eij = 0 indicates that the current solution is not an indepen-
j=1 x
n n
dent set, while i=1 xi j=1 xj eij = 0 indicates that the current solution is an
independent set. Our goal is to maximize the fitness function f1 (x). In the case
of non-independent set, algorithm accepts two operations: the first is to increase
independence point, and the second is to reduce the non-independent point and
thus reduce non-independent edges. In the case of independent set, algorithm
only accepts the operation that increasing independence points. The second term
of the fitness function gives a penalty to each non-independent point and we call
it a penalty term.
The MISP is a NP-complete problem, and for this class of problem, an algorithm
can obtain an approximation ratio α refers to that for any instance, algorithm
can generate a solution with value at least αo , where α and o represent the value
of the current solution and the value of the optimal solution, respectively. Thus,
the algorithm is called α approximation algorithm.
For MISP, there are two kinds of algorithms which are precise algorithms and
approximation algorithms. Currently, the best exact algorithm can obtain the
MISP of a graph containing n vertices in expected running time O(1.2n ) [17]. For
the approximation algorithm, it can only obtain a constant approximation ratio
on some special graphs. By using the greedy algorithm, Halldrsson et al. [18]
obtained an approximation ratio of Δ+2 3 on some graph with a vertex degree,
which has a bound of constant Δ. Khanna et al. [16] presented the approximation
performance analysis of a local search algorithm with 3-flip neighborhood on
MISP. In this paper, we showed that by simulating the local search algorithm
the (1 + 1)EA can and obtain an approximation ratio as that of the local search
algorithm.
Lemma 2 [16]. Local search algorithm with 3-flip neighborhood can obtain an
approximation ratio of Δ+1
2 on any instance of MISP with a vertex degree, which
has a bound of constant Δ.
In the following, we introduce 3-flip operation and the local search algorithm
with 3-flip neighborhood.
Suppose that S is an independent set. The so-called 3-flip operation refers to
an operation of adding a vertex to S or deleting a vertex in S and simultaneously
increasing two vertices which are not in S.
At each step, the local search algorithm improves the current solution accord-
ing to its neighbors. Suppose that S is a new independent set which is obtained
by executing a 3-flip operation on S. The framework of the local search algo-
rithm with 3-flip neighborhood is as follows: starting from an initial solution S0 ,
algorithm finds a new solution S in its neighbours by using the 3-flip operation.
This process is repeated until it meets the stop condition of the algorithm.
By simulating the local search algorithm with 3-flip neighborhood, the
(1 + 1)EA can obtain the same approximation ratio Δ+1 2 .
Theorem 1. For any given instance I of MISP, which has a bound of constant
Δ on any vertex, and let opt be the maximal number of vertices which are included
in I. The (1 + 1)EA can find an independent set with the number of vertices at
2opt
least Δ+1 in expected running time O(n4 ).
Proof. Proof We use the method of fitness partitioning to prove this theorem.
We partition the solution space into three disjoint sets A1 , A2 , A3 according to
different fitness function values.
Performance Analysis of (1+1)EA 453
By using local search algorithm with a 3-flip operation, Khanna et al. [16] have
proven that this algorithm can obtain an approximation ratio of +12 , and they
also pointed out that this approximation ratio is tight.
In the following, we construct an instance I1 and show that the local search
algorithm with 3-flip neighborhood can be trapped in local optimum on this
instance, while the (1 + 1)EA can obtain the global optimal solution in a poly-
nomial time. The vertices of this instance are divided into two sets X and Y ,
where X contains d elements and Y contains d(d−1) 2 elements, and the set Y
consists of subsets which contain one or two elements in set X. An example of
instance I1 is given in Fig. 1.
Obviously, the maximum independent set (global optimum solution) of I1
is a solution which contains all of the elements in Y but does not contain any
element in X. A solution containing all elements in X but does not contain any
454 X. Peng
element in Y is called a locally optimum solution with respect to the local search
algorithm with 3-flip neighborhood, and by executing any single 3-flip operation
the current solution can not be improved, thus the local search algorithm with
3-flip neighborhood is trapped in local optimum.
Theorem 2. Starting from any initialized solution, the (1 + 1)EA can find the
global optimal solution of instance I1 in expected running time O(n5 ).
Proof. Note that the elements in X from left to right are {1}, {2}, . . ., {d},
and the elements in Y from left to right are {1}, {2}, . . ., {d},{1, 2}, {1, 3}, . . .,
{d − 1, d}. We also use the method of fitness partitioning to prove this theorem,
and we partition the proof into two parts.
The first part: the current solution is a non-independent set. In this case,
the algorithm accepts the operation which only reduces the number of non-
independent point. Note that the maximum number of non-independent points
is n (n = d + d + d(d−1) 2 ), while the algorithm only needs to remove a non-
independent point in each step and the probability of executing this operation
is Ω( n1 ). Therefore, in the first part, the expected running time of the (1+1)EA
on instance I1 is bounded above by O(n2 ).
The second part: in this part the current solution is an independent set and
we can divide the solutions into four cases:
Case 1: If the current solution contains all elements in X, then it does not
contain any solution in Y . By deleting two elements {i} and {j} randomly from
X, and meanwhile adding elements {i}, {j} and {i, j}, the fitness value will be
increased by 1. Since the (1+1)EA accepts the solution with better fitness, for
the solution that contains all elements in x, once the algorithm is left, it will
never go back to this solution. Thus, the event that the solution contains all
of the elements in X may occur at most once in the evolution process. Now
we compute the probability that the (1+1)EA performs this operation. The
probability of choosing one element from {1, 2}, {1, 3}, . . ., {d − 1, d} is d(d−1)
2n ,
and if the chosen element is {i, j}, we need to delete two elements {i} and {j}
from X, and meanwhile add two elements {i} and {j} in Y . The probability of
executing this operation is d(d−1) 1 n−5
2n n4 (1 − n )
1
= Ω( n14 ).
Performance Analysis of (1+1)EA 455
Case 2: If the number of element, which is chosen from X, in the current solution
is less than d, and meanwhile the current solution does not contain any element
in Y . Algorithm can increase the fitness value by adding the elements in X or
adding elements in Y which are not incident to X. The probability of executing
the above operation is Ω( n1 ).
Case 3: If the number of element, which is chosen from X, in the current solution
is less than d, and meanwhile the current solution contains the element in Y , we
divide this case into two subcases:
Case 3.1: If the number of element, which is choosing from X, in the current
solution is less than d − 1. W.l.o.g., we assume that the current solution does not
contain elements {i} and {j} but contains {k}. Since the current solution is an
independent set, thus, both elements {i, k} and {j, k} are not contained in the
current solution. By deleting the element k from X and meanwhile adding two
elements {i, k} and {j, k}, the fitness value can be increased. The probability of
operating this operation is Ω( n13 ).
Case 3.2: If the number of element, which is chosen from X, in the current
solution is less than d − 1, e.g., one element in X is not selected. W.l.o.g., we
assume that the current solution does not contain the element i which is in
X. Since the current solution is an independent set, the element {i, j} is not
included in the current solution, where j = i. By deleting the element j from X
and adding two elements {j} and {i, j}, the fitness value can be increased. The
probability of doing this operation is n13 .
Case 4: If the current solution only contains the element in Y . The fitness value
can be increased by adding any element in Y . The probability of doing the above
operation is Ω( n1 ).
In the second part, the maximum fitness value is d(d−1)2 and the minimum
fitness value is 0, and we also note that the fitness value can be increased by at
least 1 through any operation of increasing the fitness value. Following lemma
1, we can obtain that the (1+1)EA can find the optimum solution of instance
I1 in expected running time O(n5 ).
5 Conclusion
References
1. Oliveto, P.S., He, J., Yao, X.: Time complexity of evolutionary algorithms for
combinatorial optimization: a decade of results. Int. J. Autom. Comput. 4(3),
281–293 (2007)
2. Neumann, F., Witt, C.: Bioinspired Computation in Combinatorial Optimization
Algorithms and Their Computational Complexity. Springer, Berlin (2010)
3. Droste, S., Jansen, T., Wegener, I.: On the analysis of the (1+1) evolutionary
algorithm. Theor. Comput. Sci. 276, 51–81 (2001)
4. He, J., Yao, X.: Drift analysis and average time complexity of evolutionary algo-
rithms. Artif. Intell. 127(1), 57–85 (2001)
5. Giel, O., Wegener, I.: Evolutionary algorithms and the maximum matching prob-
lem. In: Alt, H., Habib, M. (eds.) STACS 2003. LNCS, vol. 2607, pp. 415–426.
Springer, Heidelberg (2003)
6. Neumann, F., Wegener, I.: Randomized local search, evolutionary algorithms and
the minimum spanning tree problem. Theor. Comput. Sci. 378(1), 32–40 (2007)
7. Doerr, B., Happ, E., Klein, C.: A tight analysis of the (1+1)-EA for the single source
shortest path problem. In: Proceedings of the IEEE Congress on Evolutionary
Computation (CEC 2007), pp. 1890–1895. IEEE Press (2007)
8. Witt, C.: Worst-case and average-case approximations by simple randomized search
heuristics. In: Diekert, V., Durand, B. (eds.) STACS 2005. LNCS, vol. 3404, pp.
44–56. Springer, Heidelberg (2005)
9. Friedrich, T., He, J., Hebbinghaus, N., Neumann, F., Witt, C.: Analyses of simple
hybrid evolutionary algorithms for the vertex cover problem. Evol. Comput. 17(1),
3–20 (2009)
10. Yu, Y., Yao, X., Zhou, Z.H.: On the approximation ability of evolutionary opti-
mization with application to minimum set cover. Artif. Intell. 180–181, 20–33
(2012)
11. Zhou, Y.R., Lai, X.S., Li, K.S.: Approximation and parameterized runtime analysis
of evolutionary algorithms for the maximum cut problem. IEEE Trans. Cyber.
(2015, in press)
12. Zhou, Y.R., Zhang, J., Wang, Y.: Performance analysis of the (1+1) evolutionary
algorithm for the multiprocessor scheduling problem. Algorithmica. (2015, in press)
13. Xia, X., Zhou, Y., Lai, X.: On the analysis of the (1+1) evolutionary algorithm for
the maximum leaf spanning tree problem. Int. J. Comput. Math. 92(10), 2023–
2035 (2015)
14. Lai, X.S., Zhou, Y.R., He, J., Zhang, J.: Performance analysis on evolutionary algo-
rithms for the minimum label spanning tree problem. IEEE Trans. Evol. Comput.
18(6), 860–872 (2014)
15. Wegener, I.: Methods for the analysis of evolutionary algorithms on pseudo-boolean
functions. In: Sarker, R., Mohammadian, M., Yao, X. (eds.) Evolutionary Opti-
mization. Kluwer Academic Publishers, Boston (2001)
16. Khanna, S.: Motwani, R.: Sudan, M., Vazirani, U.: On syntactic versus computa-
tional views of approximability. In: Proceedings 35th Annual IEEE Symposium on
Foundations of Computer Science, pp. 819–836 (1994)
17. Ming, Y.X., Hiroshi, N.: Confining sets and avoiding bottleneck cases: a simple
maximum independent set algorithm in degree-3 graphs. Theor. Comput. Sci. 469,
92–104 (2013)
18. Halldrsson, M.M., Radhakrishnan, J.: Greed is good: approximating independent
sets in sparse and bounded-degree graphs. Algorithmica 18(1), 145–163 (1997)
A Robust Iris Segmentation Algorithm Using
Active Contours Without Edges and Improved
Circular Hough Transform
Abstract. Iris segmentation plays the most important role in iris biometric
system and it determines the subsequent recognizing result. So far, there are still
many challenges in this research filed. This paper proposes a robust iris seg-
mentation algorithm using active contours without edges and improved circular
Hough transform. Firstly, we adopt a simple linear interpolation model to
remove the specular reflections. Secondly, we combine HOG features and
Adaboost cascade detector to extract the region of interest from the original iris
image. Thirdly, the active contours without edges model and the improved
circular Hough transform model are used for the pupillary and limbic boundaries
localization, respectively. Lastly, two iris databases CASIA-IrisV1 and
CASIA-IrisV4-Lamp were adopted to prove the efficacy of the proposed
method. The experimental results show that the performance of proposed
method is effective and robust.
1 Introduction
The rapid development of science and technology brings us much convenience both in
life and work, at the same time, however, it also brings more security risks, so a
morereliable identity verify system is essential for proving that people are who they
claim to be. The traditional authenticationsystems, based on passwords or ID/swipe
cards that can be forgotten, lost or stolen, are not very reliable. On the other hand, the
biometrics-based authentication systems through matching the physical or behavioral
characteristics such as iris pattern, fingerprint or voice to determine the identity of
individuals, whichhave a more highly reliability, because that people are impossible to
forgot or lose their physical or behavioral characteristics and the characteristics are
difficult to be achieved by others. As every iris pattern has a highly detailed and unique
texture [1] and is stable during individuals’ lifetime, thus iris recognition becomes the
particularlyinteresting research field for identity authentication. Iris segmentation,
whose goal is to separate out the valid region from original eye image, is the most
important step for recognizing.
In recent two decades, the researchers in this field have proposed many methods for
iris segmentation. As early as 1993, John Daugman’s work [1] described a feasible iris
recognition system in detail and in 1997 Wilds [2] presented an iris biometrics system
using an absolutely different method with that of Daugman. These two classical methods
established the base of later studies. Shamsi et al. [3] improved the integral-differential
operator [1] by restricting the space of potential circles to segment the iris region. Wang
[4] proposed a surface integral-differential operator to detect the iris boundaries. Li et al.
[5] first located the region which contains pupil by the information of specular high-
lights, then detected the pupillary boundary using Canny edge detection and Hough
transform. [2, 6, 7] are all Hough transform-based segmentation methods.
All the methods mentioned above are circular fitting model. Howeverthe facts are
not always as we wish. In most instances, the pupil is not a circular area, so that it may
has more or less errors if we try to approximate the boundary with a regular circle. To
remedy this difficulty, there are also many researchers who have put forward viable
ideas. Daugman [8] used active contours model, which based on discrete Fourier series
expansions of the contour data, to segment the iris region. Ryan et al. [9] fitted the iris
boundaries using ellipses, but, for extremely irregular iris boundaries, the result is not
very ideal. Mehrabian et al. [10] detected the pupil boundary using graph cuts theory
and their algorithm was only worked on the iris images of CASIA-IrisV1, which were
pretreated. Shah and Ross [11] segmented the iris region using geodesic active con-
tours, their work gave a better solution to thedistorted iris images, and however, the
model may over-segment at the blurred iris boundaries.
In this paper we propose a robust iris segmentation algorithm that locates the iris
boundaries based on active contours without edges and improved circular Hough
transform, Fig. 1 illustrates the flow diagram of the proposed algorithm. The rest of this
paper is organized as follows: Sect. 2 presents preprocessing of the original iris images
through which to reduce the noises and computational burden. Section 3 describes the
location details of pupillary boundary using active contour without edges model. In
Sect. 4, improved circular Hough transform is used for limbic boundary locating.
Experimental results are given in Sect. 5. Conclusions are provided in Sect. 6.
2 Image Preprocessing
segmentation if we do not remove them. As the brightest area, specular reflections can
be easily detected by binarizationwhen we use an adaptive threshold (See Fig. 2b, e).
Considering that the reflections cannot be located on the edge of the image, we
adopt a simple linear interpolation model to interpolate them as shown in Fig. 3.
In processing, we orderly scan the pixel gray scales of each line of the binary image
and record the left and rightboundary points pleft and pright when detect the brightest
regions (See Fig. 3), then we regard the correspondingregion in original image
[pleft 5, pleft 1] and [pright þ 1, pright þ 5] as the left neighborhood and right
neighborhood respectively. Presenting the average gray scales of neighborhoods as
avgleft and avgright , we can calculate the pixel gray scales in the reflection region of
original image as follow:
d1
p¼ ð1Þ
d1 þ d2
Where Iðx; yÞ is the reflection pixel point in original image, d1 is the distance of
Iðx; yÞ to the left boundary and d2 is the distance of Iðx; yÞ to the right boundary. The
interpolating result can be seen in Fig. 2c, f.
Fig. 2. Illustration of specular reflections removal. (a), (d) are original iris images. (b),
(e) arebinary images in which the bright regions correspond with the reflections in (a) and (d),
respectively. (c), (f) are the iris images after reflections removal.
Fig. 3. The model of specular reflections interpolating. Each pixel value in reflection region is
determined by the average gray scale values of left and right neighborhood regions.
individual’s irises are different, they have a common structure (i.e., all the irises are
ring-like structure). In [12], Friedman et al. have proved that boosting is particularly
effective when the base models are simple and in recent years the adaptive boosting
(Adaboost) cascade detector has been proven to have a good performance in detecting
well-structured models, such as face [13], hand posture [14], etc. Here we adopt
Adaboost-cascade detector to extract the region of interest.
In thestage of training process ofAdaboost-cascade detector, we collected 1,000
positive images and 1,520 negative images as the training set. Each positive image
which contains the complete iris region is normalized with a size of 32*32, while the
negative images (non-iris) without any further processing. Histograms of oriented
gradients (HOG) [15] features are extracted from each training sample (after reflections
removal) to serve as the input data of training system. The feature extraction is pro-
cessed as follows:
1. Gradient computation. Here, [–1, 0, 1] and [–1, 0, 1] T are used for getting the
gradients of horizontal and vertical of the samples, respectively.
2. Orientation binning. Each sample is divided into 8*8 pixel cells, then vote the
gradients in each cell into 9 orientation bins in 0º– 180º(“unsigned” gradient). So
each block can be got 4*9 features. The relation between cell and block is shown in
Fig. 4. In addition, half of each block is overlapped by its adjacent blocks.
3*3*4*9 = 324 features are got from each positive sample.
Fig. 4. The relation between block and cell. Each block is consisted of four cells.
Otherwise, the input image is considered to be a non-iris image. Three images after iris
detection are shown in Fig. 5a, b, c.
The advantages of this step are that: (1). The non-iris image can be eliminated at the
beginning of processing so that the computational burden is reduced. (2). ROI is
extracted from the original image making the subsequent processing more attentively.
Fig. 5. Example images after Adaboost detection. (a) – (c) are the detection results in which the
white boxes denote the ROI area. (d) is the ROI extracted from iris image.
and
ð6Þ
Where:
1 X
c1 ¼ u0 ðx; yÞ ð7Þ
Ni insideðCÞ
1 X
c2 ¼ u0 ðx; yÞ ð8Þ
No outsideðCÞ
Here, Ni is the pixels number inside C and No is the pixels number outside C. In
Fig. 6b, c, d, e, we can see that only when the curve C on the boundary of ui can the
energy function get the minimum value, obviously. That is,
Fig. 6. (a)is the example image. (b) – (c) illustrate all the possible cases in the position of
the curve. (b) F1 ðcÞ 0; F2 ðCÞ [ 0; E [ 0. (c) F1 ðCÞ [ 0; F2 ðCÞ [ 0; E [ 0. (d) F1 ðCÞ [ 0;
F2 ðCÞ 0; E [ 0. (e) F1 ðCÞ 0; F2 ðCÞ 0; E 0:
In the active contours model of Chan and Vese [16] some regularizing terms such
as the length of the curve C and area of the region inside C are added into the energy
function. The function is shown as follow:
Where l 0; v 0; k1 ; k2 0.
A Robust Iris Segmentation Algorithm 463
Based on the energy function (10), we can get the level set formulation through
introducing Heaviside function HðÞ and Dirac function dðÞ. The formulation is
given by:
Z Z
Fðc1 ; c2 ; /Þ ¼ l dð/ðx; yÞÞjr/ðx; yÞjdxdy þ v Hð/ðx; yÞÞdxdy
u0 u0
Z
þ k1 ju0 ðx; yÞ c1 j2 Hð/ðx; yÞÞdxdy ð11Þ
u0
Z
þ k2 ju0 ðx; yÞ c2 j2 ð1 Hð/ðx; yÞÞÞdxdy
u0
R
u0 ðx; yÞHð/ðx; yÞÞdxdy
u0
c1 ¼ R ð12Þ
Hð/ðx; yÞÞdxdy
u0
R
u0 ðx; yÞð1 Hð/ðx; yÞÞÞdxdy
u0
c2 ¼ R ð13Þ
ð1 Hð/ðx; yÞÞÞdxdy
u0
In order to get the minimum of F, then the derivative of F is calculated and set to
zero, as follow:
@/ r/
¼ dð/Þ l div v k1 ðu0 c1 Þ2 þ k2 ðu0 c2 Þ2 ¼ 0 ð14Þ
@t jr/j
Fig. 7. Samples of pupillary boundary localization using active contour without edges model.
As the definition of limbic boundary is not as strong as that of the pupillary boundary, in
addition, the textures of iris near the limbic boundary are not so rich, therefore, the error
is acceptable when we using an regular circle to approximate the limbic boundary-
whilethe eye in frontal sate. In this section, an improved circular Hough transform is
used for locating the limbic boundary. The detailed processing is as follows:
1. Resizing. Taking the ROI image as input and resize it with a predetermined scale,
through which can greatly decrease the computing burden of subsequent process-
ing. In experiment we set the scale = 0.3.
2. Blurring. In order to remove the noises that may affect the limbic boundary
locating, here we adopt Gaussian filter to blur the output image of step 1. In Fig. 8
we can see that the edge detecting result after Gaussian filter processing (Fig. 8b)
has fewer noises than the result without the processing (Fig. 8a).
Fig. 8. Comparison between the results of edge detecting. (a) is the result without Gaussian filter
processing. (b) is the result with Gaussian filter processing.
located in Sect. 3, when searching the best-fit-circle. In this way, we can greatly
improve the processing efficiency.
The locating process follows the improved circular Hough transform model. Here
we define that the edge points detected by Canny are belong to the points set E, the
coordinate points inside the pupil region are belong to the points set P and the radius of
fit-circles which we search for is limited in theinterval ½rmin ; rmax . So fit-circles are
limited in a 3D space as ðx0 ; y0 ; rÞ, where ðx0 ; y0 Þ 2 P and r 2 ½rmin ; rmax . The circular
Hough transform model is given as (15) and (16):
rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
X X
359
2ph 2ph
Wðx0 ;y0 ;rÞ ¼ ðSð ððx r cosð Þ x0 Þ2 þ ððy r sinð ÞÞ y0 Þ2 eÞÞ
ðx;yÞ2E h¼0
360 360
ð14Þ
Where SðxÞ is a voting function, if x\0, the output is 1, otherwise the output is 0.
So if the edge point fit the specified circle model ðx0 ; y0 ; rÞ, then adds 1 to the weight of
the circle WðX0 ;y0 ;rÞ . e is a fixed and predefined small value, in experiment we set e ¼ 3.
After the voting process for each candidate circle model, we can get a set of weights
of potential circles which is shown as (17).
Circleweight ¼ fWðx0 ; y0 ; rÞjðx0 ; y0 Þ 2 P; r 2 ½rmin ; rmax g ð17Þ
The best-fit circle is the one whose weight is the largest in the set of Circleweight ,
The best-fit circle is written by (18).
Circlebestfit ðx0 ; y0 ; rÞ ¼ max Circleweight ð18Þ
ðx0 ;y0 ;rÞ
Figure 9 shows some process results using improved circular Hough transform.
Fig. 9. Samples of limbic boundary locating results using improved circular Hough transform
model.
466 Y. Ren et al.
5 Experimental Results
For database CASIA-IrisV4-Lamp, there are lots of iris samples that have long
eyelashes. In order to improve the performance, 1D rank filter was adopted to process
the eyelashes before segmentation. The experiment results for database
CASIA-IrisV4-Lamp are shown in Table 2.
Some samples of segmentation results from database CASIA-IrisV4-Lamp are
shown in Fig. 11.
A Robust Iris Segmentation Algorithm 467
The above two groups of experimental results demonstrate the effectiveness and
robustness of the proposed method.
6 Conclusions
Iris segmentation is a critical module in iris recognition system and its accuracy will
affect the recognition results seriously. In this paper we propose a robust iris seg-
mentation method using active contours without edges and improved circular Hough
transform. Firstly, we use a simple linear interpolation method to remove specular
reflections and combine the HOG features and Adaboost-cascade detector to detect the
region of interest. Secondly, the active contours without edges model is adopted to
locate the pupillary boundary. Finally, improved circular Hough transform is used for
limbic boundary locating. Experimental results show that the performance of proposed
method in this paper is better than that of Daugman [18] and Wilds [2].
This paper does not involve the research about eyelids and eyelashes processing,
which also is an important part in iris segmenting. So in the future work, we will pay
more time to remove the interferences from eyelids and eyelashes and to extract out the
most effective iris region for subsequence matching.
Acknowledgement. The authors wish to thank the Chinese Academy of Sciences’ Institute of
Automation (CASIA) for providing CASIA iris image databases
References
1. Daugman, J.G.: High confidence visual recognition of persons by a test of statistical
independence. IEEE Trans. Pattern Anal. Mach. Intell. 15, 1148–1161 (1993)
2. Wildes, R.P.: Iris recognition: an emerging biometric technology. Proc. IEEE 85, 1348–1363
(1997)
468 Y. Ren et al.
3. Shamsi, M., Saad, P.B., Ibrahim, S.B., Kenari, A.R.: Fast algorithm for iris localization
using daugman circular integro differential operator. In: International Conference of Soft
Computing and Pattern Recognition, SOCPAR 2009, pp. 393–398 (2009)
4. Chunping, W.: Research on iris image recognition algorithm based on improved differential
operator. J. Convergence Inf. Technol. 8, 563–570 (2013)
5. Peihua, L., Xiaomin, L.: An incremental method for accurate iris segmentation. In: 19th
International Conference on Pattern Recognition, 2008, ICPR 2008, pp. 1–4 (2008)
6. Bendale, A., Nigam, A., Prakash, S., Gupta, P.: Iris segmentation using improved hough
transform. In: Huang, D.-S., Gupta, P., Zhang, X., Premaratne, P. (eds.) ICIC 2012. CCIS,
vol. 304, pp. 408–415. Springer, Heidelberg (2012)
7. Mahlouji, M., Noruzi, A., Kashan, I.: Human iris segmentation for iris recognition in
unconstrained environments. IJCSI Int. J. Comput. Sci. 9, 149–155 (2012)
8. Daugman, J.: New methods in iris recognition. IEEE Trans. Syst. Man Cybern. Part B
Cybern. 37, 1167–1175 (2007)
9. Ryan, W.J., Woodard, D.L., Duchowski, A.T., Birchfield, S.T.: Adapting starburst for
elliptical iris segmentation. In: 2nd IEEE International Conference on Biometrics: Theory,
Applications and Systems, BTAS 2008, pp. 1–7 (2008)
10. Mehrabian, H., Hashemi-Tari, P.: Pupil boundary detection for iris recognition using graph
cuts. In: Proceedings of Image and Vision Computing New Zealand 2007, pp. 77–82 (2007)
11. Shah, S., Ross, A.: Iris segmentation using geodesic active contours. IEEE Trans. Inf.
Forensics Secur. 4, 824–836 (2009)
12. Friedman, J., Hastie, T., Tibshirani, R.: Additive logistic regression: a statistical view of
boosting (With discussion and a rejoinder by the authors). Ann. Stat. 28, 337–407 (2000)
13. Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In:
Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and
Pattern Recognition, CVPR 2001, vol.1, pp. I-511–I-518 (2001)
14. Yao, Y., Li, C.-T.: Hand posture recognition using SURF with adaptive boosting. In:
Presented at the British Machine Vision Conference (BMVC), Guildford, United Kingdom
(2012)
15. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2005,
vol. 1, pp. 886–893 (2005)
16. Chan, T.F., Vese, L.A.: Active contours without edges. IEEE Trans. Image Process. 10,
266–277 (2001)
17. CASIA iris image database: http://biometrics.idealtest.org/
18. Daugman, J.: How iris recognition works. In: Proceedings of 2002 International Conference
on Image Processing, vol. 1, pp. I-33–I-36 (2002)
An Adaptive Hybrid PSO and GSA Algorithm
for Association Rules Mining
1 Introduction
The main goal of mining association rules is to find hidden relationship between items
from a given transactional database. Since it was proposed, association rules had been
attracted universal attention of scholars and deeply studied. And it has been success-
fully applied to many different fields, such as shopping cart analysis [1], stock market
analysis and network attacks analysis [2] and so on.
Many algorithms have been proposed for association rules, such as FP-tree [3], the
classic algorithm: Apriori [4, 5] and the other algorithms based on frequent itemsets.
The current dataset is featured by big volumes of data. The classical association rules
mining algorithms dealt with data sets somehow in an efficient way and in reasonable
time. However they are difficult to be applicable to the current amount of data. Recent
years, research on intelligent heuristic algorithms has got great progress. The intelligent
algorithms have been used to solve non-convex and non-linear optimization problem.
Hence, people are trying to use new intelligent algorithms to mine association rules,
such as genetic algorithm [6], particle swarm algorithm [7], ant colony algorithm [8],
some hybrid algorithms [9, 10] and so on. In [9], the authors propose SARIC algo-
rithm, which uses set particle swarm optimization to generate association rules from a
database and considers both positive and negative occurrences of attributes. The
experiments verify the efficiency of SARIC. In [10], a novel hybrid algorithm called
HPSO-TS-ARM has been proposed for association rules mining. These algorithms are
based three well known high-level procedures: Particle Swarm Optimization, Tabu
Search and Apriori Algorithm. Particle swarm optimization (PSO) is a new evolu-
tionary algorithm developed in recent years. It is also from the random solution,
through the iterative search for the optimal solution, it is also through the fitness to
evaluate the quality of the solution, but it is more simple than the genetic algorithm.
This algorithm has attracted the attention of academic circles and has demonstrated its
superiority in solving practical problems with its advantages of easy realization, high
accuracy and fast convergence. PSO owns the ability of convergence speed, but it
suffers from easily falling into local optimal especially for mining association rules.
A novel heuristic optimization method called Gravitational Search Algorithm
(GSA) based on the law of gravity and mass interactions is proposed in [11]. It has
been proven that this algorithm has good ability to search for the global optimum, but it
suffers from slow searching speed in the last iterations. Utilizing intelligent algorithms
to resolve problems, some parameters need to be set in advance, such as the inertia and
acceleration coefficients in PSO, mutation, selection probability in GA and so on.
Researches have shown that parameters have great influence on the performance of the
algorithms processing various problems. For example, Bergh et al. estimate the
influence of parameters, such as the inertia term on the performance of the algorithm
[12]. In the past few years, many approaches have been applied to investigate how to
adapt suitable parameters. The acceleration coefficients in PSO algorithm are varied
adaptively during iterations to improve solution quality of original PSO and avoid
premature convergence [13]. According to the results of the experiments, the
self-adaptive scheme proposed in [13] can well balance the capabilities of exploration
and exploitation. In [14], An adaptive particle swarm optimization (APSO) based on
the population distribution of information during the process of evolution state is
presented to adaptively control the acceleration coefficients, Results show that APSO
substantially enhances the performance of the PSO. Motivated by the hybrid idea of the
algorithms and the adaptive approach to control the acceleration coefficients, in this
paper, we propose a new algorithm called A_PSOGSA for association rules mining.
The rest of this paper is organized as follows. In Sect. 2, we introduce the related
theory of the association rules and PSO, GSA. In Sect. 3, the new algorithm is pro-
posed. Section 4 summarizes our experimental results. We demonstrate the good
performance of the algorithm A_PSOGSA and compare it to the other ARM algo-
rithms. In Sect. 5 we point out the innovation of the article and give a conclusion to this
paper by some remarks.
2 Related Theory
This section mainly introduces the basic concept of association rules and the related
theory of PSO and GSA.
An Adaptive Hybrid PSO and GSA Algorithm for Association Rules Mining 471
xi ðt þ 1Þ ¼ xi ðtÞ þ vi ðt þ 1Þ ð2Þ
where w is the inertia weight, c1, c2 are two constants which are usually 2, r1, r2 are
random numbers between 0–1. As shown in Eq. 1, the first part provides exploration
ability for PSO, the second part represents individual cognition which lead particles to
move toward the particles’ optimal position experienced, the third part represents group
cognition which lead particles to move toward the best position found by swarm.
PSO has the advantage of fast convergence speed, but it is easy to fall into local
optimum, especially dealing with huge amount of data. Particles update their positions
by tracking only two extreme values, without considering the interaction between
particles. If considering the influence between particles, the probability of falling into
local optimum will be reduced and efficiency will be improved.
472 Z. Zhou et al.
where G(t)is gravity coefficient at a specific time t; Rij ðtÞ is the Euclidean distance
between two agents, calculated as follows:
where a a is the descending coefficient, G0 is the initial gravitational constant, iter is the
current iteration, and maxiter is the maximum number of iterations.
In the search space, the total gravitational force of agent i is represented by the
following formula:
X
n
Fi ðtÞ ¼ randj Fij ðtÞ ð6Þ
j¼1;j6¼i
Fi ðtÞ
ai ðtÞ ¼ ð7Þ
Mii ðtÞ
xi ðt þ 1Þ ¼ xi ðtÞ þ vi ðt þ 1Þ ð9Þ
The masses of all agents are updated using the following equations:
worst, best depends on the type of the problems studied, namely it is minimization
problem or maximization problem.
GSA considers the interaction between particles, but without memory function, the
agents can not observe global optimal. The main drawback is that the convergence
speed of the algorithm is slow.
xi ðt þ 1Þ ¼ xi ðtÞ þ vi ðt þ 1Þ ð12Þ
where the parameters of w; c1 ; c2 ; r1 ; r2 have the same meaning as they are explained in
PSO, ai ðtÞ is the acceleration of the particle i at the iteration t which is calculated by
GSA. In the combination algorithm, the updating of the velocity is not only dependent
on the local optimum and global optimum, but also the acceleration of the particles
which is calculated by GSA. That is to say, in the new updating method, the particles
are affected by the full force of all the particles, which makes the algorithm more
efficient when searching in the problem space as we can see in the following aspects:
The fitness of particles is considered in the updating process of each iteration; Particles
near the optimal value can be able to attract particles that attempt to explore other
search space; The algorithm has the ability of memory to save the global optimum
found so far, so we can get the best solution at any time, which is very useful for
association rules mining; Any particle can perceive the global optimal solution and
keep close to him.
When we use the proposed algorithm to mine association rules, some specific
action must be taken on the particles.
The encoding solution: the coding method used in this paper is that a particle
represents a rule, every particle S is a vector of n + 1 elements (n is the number of all
attributes or items) where: S[1] is index separator between the consequent and the
antecedent parts, S[2]-S[n + 1] is the position of the items consists with two parts which
are attributes and the data point’s linguistic values. The item position’s value maybe 0,
representing the attribute at this position does not exist in the rule.
The fitness function: the fitness function is the correlation between the algorithm
and association rules, a good fitness function can help algorithms better dig. Minimum
474 Z. Zhou et al.
support, and minimum confidence are two threshold of the rule. The fitness is of rule S
defined as follows:
Where w is used to increase the number of meaningful rules, in this paper set w 2.
jDj is the total number of records in a transaction database.
c1f c1i
c1 ¼ c1i þ iter ð15Þ
maxiter
Where, c1f ; c1i care initial and final values of cognitive and social components
acceleration factors respectively.
An Adaptive Hybrid PSO and GSA Algorithm for Association Rules Mining 475
In order to verify the feasibility, accuracy and efficiency of the A-PSOGSA algorithm
for mining association rules, the algorithm is programmed by matlab2009,which is
running on the platform of ordinary Lenovo computer whose clocked is 1.78 GHz,
memory is 3G. The test data sets are from the UCI public data set: Car Evaluation,
Nursery, Page Block, Balance. The download address is http://archive.ics.uci.edu/ml/
datasets.html. The features of the datasets are described in Table 1.
For A_PSOGSA algorithm, c1, c2 are initialized to 2 and adaptively controlled
according to different evolutionary states. w decreases linearly from 0.9 to 0.4, the a
and G0 in GSA are set to be 20 and 1. The initial population size of particle is 100; the
maximum number of iterations is 300.
0.9
0.8
0.7
Fitness function
0.6
0.5
0.4
worst
0.3 best
mean
0.2
0.1
0
0 5 10 15 20 25 30 35 40
Iterations
0.9
0.8
0.7
Fitness function
0.6
0.5
0.4
worst
0.3 best
mean
0.2
0.1
0
0 5 10 15 20 25 30 35 40
Iterations
550
500 A-PSOGSA
PSO
450 GA
APSO
400
350
Number of rules
300
250
200
150
100
50
0
0 50 100 150 200 250 300
Iterations
Fig. 3. The mining result comparison of A_PSOGSA, PSO, GA and APSO on Page Block
478 Z. Zhou et al.
Table 2. The results of association rules mining of A_PSOGSA, PSO, GA and APSO
Data set Algorithm Sup Conf rules
Page Block A_PSOGSA 0.68 0.93 500
PSO 0.71 0.86 293
GA 0.65 0.85 360
APSO 0.70 0.90 402
Car Eva A_PSOGSA 0.69 0.71 26
PSO 0.64 0.68 21
GA 0.60 0.70 24
APSO 0.66 0.72 23
Nursery A_PSOGSA 0.56 0.82 41
PSO 0.53 0.80 41
GA 0.54 0.79 38
APSO 0.56 0.80 40
Balance A_PSOGSA 0.38 0.75 36
PSO 0.38 0.76 32
GA 0.35 0.73 30
APSO 0.37 0.76 32
5 Conclusion
In this paper, a novel algorithm called A_PSOGSA is proposed for mining association
rules. This progress is made possible by integrating PSO and GSA, which utilizes the
high convergence speed of PSO and global searching ability of GSA, by adaptively
controlling the acceleration coefficient with the population distribution information.,
which provides a better balance between global exploration and local exploitation. The
experiments demonstrate the feasibility of the presented algorithm when mining
association rules. The mining results show that A_PSOGSA algorithm’s performance
has been improved compared with PSO and GA algorithm In summary, the algorithm
is suitable to mine association rules and has good performance.
References
1. Cao, L., Zhao, Y., Zhang, H., et al.: Flexible frameworks for actionable knowledge
discovery. J. Knowl. Data Eng. 22(9), 1299–1312 (2010)
2. Lan, G.C., Hong, T.P., Tseng, V.S.: A projection-based approach for discovering high
average-utility itemsets. J. Inf. Sci. Eng. 28(1), 193–209 (2012)
An Adaptive Hybrid PSO and GSA Algorithm for Association Rules Mining 479
3. Han, J., Pei, J., Yin, Y.: Mining frequent patterns without candidate generation. In:
SIGMOD 2000, Proceedings of the 2000 ACM SIGMOD International Conference on
Management of Data, pp. 1–12. ACM, New York (2000)
4. Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in
large databases. In: SIGMOD 1993, Proceedings of the 1993 ACM SIGMOD International
Conference on Management of Data, pp. 207–216. ACM, New York (1993)
5. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: Proceedings of the
20th International Conference on Very Large Data Bases, VLDB, pp. 487–499 (1994)
6. Minaei-Bidgoli, B., Barmaki, R., Nasiri, M.: Mining numerical association rules via
multi-objective genetic algorithms. J. Inf. Sci. 233, 15–24 (2013)
7. Beiranvand, V., Mobasher-Kashani, M., Bakar, A.A.: Multi-objective PSO algorithm for
mining numerical association rules without a priori discretization. J. Expert Syst. Appl. 41
(9), 4259–4273 (2014)
8. Sundaramoorthy, S., Shantharajah, S.P.: An improved ant colony algorithm for effective
mining of frequent items. J. Web Eng. 13(3–4), 263–276 (2014)
9. Agrawal, J., Agrawal, S., Singhai, A., et al.: SET-PSO-based approach for mining positive
and negative association rules. J. Knowl. Inf. Syst., 1–19 (2014)
10. Kaur, S., Goyal, M.: Fast and robust hybrid particle swarm optimization tabu search
association rule mining (HPSO-ARM) algorithm for web data association rule mining
(WDARM). J. Adv. Res. Comput. Sci. Manag. Stud. 2, 448–451 (2014)
11. Rashedi, E., Nezamabadi-Pour, H., Saryazdi, S.: GSA: a gravitational search algorithm.
J. Inf. Sci. 179(13), 2232–2248 (2009)
12. Van den Bergh, F., Engelbrecht, A.P.: A study of particle swarm optimization particle
trajectories. J. Inf. Sci. 176(8), 937–971 (2006)
13. Mohammadi-Ivatloo, B., Moradi-Dalvand, M., Rabiee, A.: Combined heat and power
economic dispatch problem solution using particle swarm optimization with time varying
acceleration coefficients. J. Electr. Power Syst. Res. 95, 9–18 (2013)
14. Zhan, Z.H., Zhang, J., Li, Y., et al.: Adaptive particle swarm optimization. J. Syst. Man
Cybern. 39(6), 1362–1381 (2009)
15. Sarath, K., Ravi, V.: Association rule mining using binary particle swarm optimization.
J. Eng. Appl. Artif. Intell. 26(8), 1832–1840 (2013)
Sequential Pattern Mining and Matching
Method with Its Application on Earthquakes
1 Introduction
which is quite different from the Apriori based algorithm and proved to be much more
efficient.
In general, time series data has characteristics of high dimensions, and the choice of
methods which represent the sequential pattern [5] is of great importance. The fre-
quency domain representation maps time series to frequency domain space using the
Discrete Fourier Transform (DFT), while Singular Value Decomposition [6]
(SVD) represents the whole time series database integrally by dimensions reduction.
Symbolic representation [7] is to map time series discretely to character string.
Studies on the emissions of chemical gas before the earthquake, such as, carbonic
oxide (CO), methane (CH4), etc., are paid great attention to. Through the analysis of
large area CO gas escaping from Qinghai Tibet Plateau on April 30, 2000, the Earth
Observation System (EOS) reveals that there is anomalous layer structure in abnormal
high CO content areas [8]. Supervised instances show that abnormal phenomenon
before the earthquake exists objectively resulting from the increased emissions of
greenhouse gases. According to the analysis of the 18 dimensions attributes of
EOS-AQUA satellite data, it is shown through a large number of experiments that the
CO content results of the abnormal sequence mining trend to be relatively good.
Therefore, the experiments in the paper are based on the analysis CO content.
The rest of this paper is organized as follows. In Sect. 2, some related definitions
are introduced. Section 3 is devoted to present the abnormal findings method upon
sequence mining. The analysis of the experimental results is provided in Sect. 4. In
final, the summary of this paper and future work are discussed in Sect. 5.
2 Related Definitions
Definition 4 (Sequence class): Sequences which is partly similar to each other are
classified as a set, named sequence class. To be specific, Fig. 1 is the result of 10
seismic data sequential patterns, namely, the set of 10 sequential patterns. This
sequence class is represented as < S1, S2, S3, S4, S5, S6, S7, S8, S9, S10 > , where Si
stands for a mined frequent sequence of the data processed by symbolization.
a c c c c d
a a c c c c d d
c c c c c d
a c c c c c d d
a a c c c d d e
a a c c c c c d d e d
a a c c c c d
c c c d d
a c c c c c d d e
a c d d e d
Definition 5 (Sequence focus): The sequence, which gets the highest inclusive degree
among all the sequences, is the focus of the sequences, referred to as sequence focus.
Here, the inclusive degree of Si is defined to be the ratio reflecting the degree how far
the sequence Si contains the other sequences in the same sequence class. Take sequence
class in Fig. 1 for example, the sequence with the highest inclusive degree, which is
100 % here, is {a a c c c c c d d e d}, therefore we regard this sequence as the sequence
focus of the sequence class.
Definition 6 (Difference set of sequential pattern): In view that seismic precursory data
possibly contains non-seismic factors, we mine frequent sequences from both seismic
data and non-seismic data. Then, difference set of sequential pattern is generated by
subtracting the non-seismic sequence set from the seismic sequence set. That is, if one
sequence from the frequent seismic sequence set occurs in the frequent non-seismic
sequence set, the support of this sequence is subtracted and the sequence turns to be
saved or abandoned depending on whether the subtracted support is no less than the
initialized minimum support or not.
Step 2. In such a way that the frequent sequential patterns are generated, the specific
sequential pattern before the earthquake is figured out. Moreover, sequence focuses
meeting the defined conditions are located among the sequence class, after which the
sets of sequence focus are formed as well as the matching algorithm.
Step 3. With the matching algorithm before the earthquake improved, the accuracy
rate, the missing report rate and the false positive rate are computed to confirm the
validity of this method.
Algorithm PrefixSpan
Input : Sequence database S and the minimum support min_support
Output : A set of complete sequence pattern
1 Read in the sequence database S and the minimum support threshould min_sup.
2 Set sequence length K = 1 for the first time, and find out frequent sequence
S with length of K from mapped database, where frequent sequence is no less
than min_sup in the database.
3 By dividing the search space through S, respectively mine frequent sequnces,
which obtain the Prefix and sequence length of K + 1. If the result of the mining
empty, step 3 is turned to step 5.
4 Increase k to k+1, L founded in step 3 is assigned to S , and turn to step 2.
5 Record and output all the mined frequet sequence.
In addition, as a kind of depth first search algorithm, it maps the data to a smaller
database recursively in the process of projection. On account of no need to generate
candidate sequential patterns, the search space is shrunk as well as the scale of the
projection database. Thereby, the efficiency of mining is enhanced to a great extent.
484 J. Zhu et al.
Where, α represents a time series like <S1, S2, S3 … Sn>, and F is the set of all the
sequence focus, namely, {F1, F2, F3 … Fi}. The function LCS(α, Fi) is used to get the
longest common subsequence between sequence α and sequence focus Fi. If the longest
common subsequence is empty, that is, isempty(LCS(α, Fi)) = 1, it means a failure
match. Furthermore, the matching function is set to be 0, otherwise, to be 1.
The factors that influences the matching algorithm contain precursor time, pre-
cursor area, sequence support and data segment. In the case that the above parameters
are set, matching degree can be further transformed to formula (2).
X
n X
n
f degðaÞ ¼ match funðFi Þ Fi ð2Þ
i¼0 i¼0
Here, α and Fi play the same role as the above formula (1). By means of a large
number of experiments, it turns out that when the matching degree belongs to [0.4,
0.7], the predicting results trends to be better.
1; f degðaÞ sup Ratio
f validðFi Þ ¼ ð3Þ
0; f degðaÞ\ sup Ratio
It is indicated in Formula (3) that when the matching degree is no less than the
defined support, the data is valid, namely, f_valid(Fi) = 1.
X
n
match numðFÞ ¼ f validðFi Þ ð4Þ
i¼0
Formula (4) primarily aims to calculate the number of testing cases which is under
certain condition, so as to work out both the accuracy rate and missing report rate.
The core concept of FreSeqMatching algorithm firstly is to positively verify seismic
test set via the frequent sequence set, after which sequence matching degrees are
figured out. Furthermore, seismic test data and non-seismic test data are matched by the
mined frequent item sets.
Sequential Pattern Mining and Matching Method 485
Algorithm PreSeqMatching
Input : Frequent sequent set freqSeq , quake model data quakeModel,
quake test data quakeTest, normal test data normalTest
Output : Accuracy rate matchRatio and missing report rate falseRatio
1 Read in frequent sequence set and data set of quakeModel, quakeTest, normalTest.
2 Initialize and set support supRatio.
3 Call function GetLFreq( ) to simplify freqSeq set.
4 Work out the model matching ratio of quakeModel set by the formula of
modelMatchRatio = MatchingDegree (freqSeq, quakeModel).
5 If the result of modelMatchRatio meets the conditions, then turn to step 7.
6 If not, reset the support and supRatio turn to step 4.
7 For quakeTest set, calculate the matchRatio by the formula of
matchRatio = MatchingDegree (freqSeq, quakeTest).
8 While for normalTest set, calculate the falseRatio through this formula
falseRatio = MatchingDegree (freqSeq, normalTest).
Analysis:
(1) Step 1 and step 2 is for initialization. Step 3 aims at simplifying frequent sequence
sets by GetLFreq function. With the purpose of backward verification through the
modeling data, step 4 to 6 is in demand. What’s more, step 7 is to calculate the
prediction accuracy rate. Meanwhile, the false positive rate is worked out in step 8.
(2) The GetLFreq function above is used for simplifying frequent sequence sets.
Function GetLfreq
Input : Frequent sequence set freqSeq
Output : Simplified frequent sequences
1 Read in the number of freqSeq m.
2 Initialize the min_sup of freqent sequence.
3 For m frequent sequences, delete frequent sequences with support less
than min_sup , update m.
4 For m frequent sequences, figure out the longest common sequence between
every two sequences by comSeq = LCS(freqSeq[i],freqSeq[j]) , and finally turn to
step 8.
5 If comSeq is a intersection of the two sequences or empty, turn to step 4.
6 If comSeq = freqSeq[i] , then mark freqSeq[i] as flag , turn to step 4.
7 If comSeq = freqSeq[j] , mark freqSeq[j] as flag , turn to step 4 as well.
8 Delete sequences marked with flag in step 4 to 7, and gain simplified
frequent sequences.
Function MatchingDegree
Input : Frequent sequence set freqSeq and quake test data quakeTest
Output : The accuracy rate matchRatio
1 For each quakeTest[i] , for each simplified freqSeq[i] , find out the longest
common sequenct by the function LCS(quakeTest[i],freqSeq[j]).
2 Calculate the value of match_fun(freqSeq[i]) through formula (2) above.
3 For each earthquake case freqSeq[i] , figure out its matching degree based
n n
on formula ( 3), f_deg(α )= ∑ match_ fun (freqSeq[i]) ÷ ∑ freqSeq[i] .
i =0 i =0
20.5°N to 48.5°N. On account of the lack of enough earthquake cases, precursor area of
2°*2° is applied to obtain more earthquake samples in the experiment.
The main steps about data preprocessing are as follows.
(1) Data interpolation: among the remote sensing original data, outliers are repre-
sented by −9999, standing for the missing of the data. Nevertheless, it can be
easily found that there is a certain amount of missing data. Therefore, data
recovery is extremely necessary, namely, data interpolation. In this experiment,
linear interpolation method is applied to take the place of the missing data
appropriately.
(2) Data normalization: as a result of the influence of regional factors, remote sensing
data are normalized in this paper. In view of the seasonal factors, the normal-
ization in this experiment is corresponding to each month. That is, the mean
values of all the historical data without earthquakes are computed in month, after
which the percentage values divided by the average are figured out around 1.
Hence, it can more effectively reflect the change trend of the data during the
precursor time.
(3) Data segment: with the purpose of effectively representing the change trend of
data, the linear segment method is applied on the basis of data normalization to
turn into character representation. Consequently, it turns to be more convenient
for mining sequential patterns. In order to gain better prediction results, different
segments are adopted, such as 5, 7, 10 segments, to conduct experiments
respectively.
It is known from Table 1 that we have conducted 72 experiments to find out the
better precursor time, precursor area, sequence support and data segments.
Analysis of Results. The prediction rate applied in the results is worked out as
follows.
(1) SeismicData_CorrectRate, which is short for the correct rate of applying seismic
data to predict earthquakes.
TnumðSeismicDataTest TrueÞ
SeismicData CorrectRate ¼ ð5Þ
TnumðSeismicDataTest AllÞ
(3) NormalData_FalseRate, which represents the false rate of using the normal data to
predict earthquakes in this experiment
TnumðNormalDataTest TrueÞ
NormalData FalseRate ¼ ð7Þ
TnumðNormalDataTest AllÞ
15 %. By contrast, the results are shown in Fig. 4, with X-axis to be the number of
earthquake cases, and Y-axis to be the sequence matching support.
To explain Fig. 4 clearly, the sequence matching degree, which comes from the
matching algorithm with the use of frequent patterns obtained from sequential pattern
mining algorithm, reflects the similarity degrees between the testing cases and the
mined earthquake frequent patterns. For seismic test data, it can be seen from the Fig. 4
that, when the matching support is set to be 0.5, the matching degree of NO.1 case is
0.6, greater than 0.5, so it is predicted to be seismic data. Whereas the matching degree
of NO.2 case is 0.4, less than 0.5, it is conversely regarded as non-seismic data. As for
non-seismic test data, NO.3 case is classified as non-seismic data, with matching degree
of 0.24, obviously less than 0.5. Meanwhile, on account of the 0.63 matching degree,
greater than 0.5, No.6 case is forecasted to be seismic data.
Hereby, there exist 13 cases of data with matching degree no less than 0.5 and 7
opposite cases among 20 cases of seismic data. Therefore, the accuracy rate is figured
out to be 65 % based on Formula (5), with the missing report rate of 35 % on the basis
of Formula (6). Besides, in 20 cases of non-seismic data, the number of cases with no
less than 0.5 matching degree is 3, and the opposite is 17. Here comes the conclusive
result that the false positive rate of prediction is 15 % in accordance with Formula (7).
5 Conclusions
regular pattern of remote sensing data from a new point of view. As a consequence,
effective abnormal patterns implied in the history are mined to realize the prediction
preferably by pattern matching.
The prediction before the earthquake upon sequential pattern matching still remains
several aspects to be improved as follows.
(1) If a better interpolation method is considered when replacing the invalid data, the
actual missing value could be reflected more precisely, which makes the mined
sequential pattern to be much more accurate to a certain extent.
(2) With time factor involved in discovered sequential patterns, a real-time prediction
could gain more actual application value.
References
1. Alvan, H.V., Azad, F.H., Omar, H.B.: Chlorophyll concentration and surface temperature
changes associated with earthquakes. Nat. Hazards 64(1), 691–706 (2012)
2. Dong, X., Pi, D.C.: Novel method for hurricane trajectory prediction based on data mining.
Natural Hazards Earth Syst. Sci. 13, 3211–3220 (2013)
3. Zaki, M.J.: SPADE: an efficient algorithm for mining frequent sequences. Mach. Learn.
42(1–2), 31–60 (2001)
4. Pei, J., Han, J., Mortazavi-Asl, B., et al.: Prefixspan: Mining sequential patterns efficiently by
prefix-projected pattern growth. In: 2013 IEEE 29th International Conference on Data
Engineering (ICDE), pp. 0215–0215. IEEE Computer Society (2001)
5. Bettaiah, V., Ranganath, H.S.: An analysis of time series representation methods: data mining
applications perspective. In: Proceedings of the 2014 ACM Southeast Regional Conference,
p. 16. ACM (2014)
6. Tong, X.H., Ye, Z., Xu, Y.S., et al.: A novel subpixel phase correlation method using singular
value decomposition and unified random sample consensus. IEEE Trans. Geosci. Remote
Sens. 53, 4143–4156 (2015)
7. Baydogan, M.G., Runger, G.: Learning a symbolic representation for multivariate time series
classification. Data Min. Knowl. Discov. 29, 1–23 (2014)
8. Yao, Q., Qiang, Z., Wang, Y.: CO release from the Tibetan plateau before earthquakes and
increasing temperature anomaly showing in thermal infrared images of satellite. Advances in
earth science 20(5), 505–510 (2005)
Teaching Quality Assessment Model Based
on Analytic Hierarchy Process
and LVQ Neural Network
1 Introduction
The quality of higher education has received increasingly more attention nowadays,
and accordingly universities and colleges are trying to improve the teaching quality
through various means. In the area of teaching quality assessment, manual assessment
method used to be the predominant way to evaluate teachers’ effectiveness in class-
room teaching. However due to its innate limitations such as high subjectivity and low
precision, gradually it has been almost out of use. Then an assessment method based on
an expert system was introduced. It has higher evaluation precision compared with the
traditional manual assessment method [1–3]. However teaching quality assessment is a
comprehensive process which involves consideration of many factors, such as levels of
assessment, targets and indices of assessment. Besides, in an expert system, the
selection of indices is inevitably influenced by the designers’ knowledge background as
well as their personal preferences. All these undermine the objectivity and precision of
an expert system.
With the development of information technology, new methods are applied in
teaching quality assessment, including multiple linear regression (MLR), analytic
hierarchy process (AHP), partial least squares (PLA), genetic algorithm (GA), support
vector machines (SVM) and artificial neural networks (ANN). One study introduces an
© Springer International Publishing Switzerland 2015
Z. Huang et al. (Eds.): ICCCS 2015, LNCS 9483, pp. 491–500, 2015.
DOI: 10.1007/978-3-319-27051-7_42
492 S. Hu et al.
assessment model based on a support vector machine. All the indices are put into the
support vector machine after quantization. The parameters are improved by genetic
algorithm. Although the assessment precision is improved, the convergence rate is still
slow [4]. In another study, a multi-group genetic algorithm is employed to improve the
parameters of a BP neural network too. The precision is improved, however the
algorithm is complicate and the training speed of the network is not mentioned [5].
Researches also find out that when a BP network is employed to evaluate a compre-
hensive assessment system, it often fails to take a thorough consideration of the weights
of all the indices [6, 7, 8, 9].
Currently, most of the studies focus on how to build an assessment model based on
BP neural network and how to improve parameters of the BP network. Although a BP
neural network is outstanding in nonlinear mapping, its shortcomings are not easy to
overcome, like slow convergence rate and the local minimum [10]. Learning vector
quantization (LVQ) neural network adopts a supervised learning algorithm. Sample
clustering in the competing layer is easily obtained by computing the distances between
the input sample vectors. Long training time, complicate computation and other problems
of BP neural networks can be overcome by an LVQ neural network to a large degree [11].
In this paper, a teaching quality assessment system is established using analytic
hierarchy process. The indices of the system are taken as the input variables of an LVQ
network to build an AHP-LVQ network model. Its effectiveness is tested by comparing
with that of a traditional BP network model.
Then a judgment matrix is built to conduct pair-wise comparisons among all the indices
of the same level. All the indices are put in order when their weights are determined.
A consistency test is also conducted [14]. The indices are classified into 3 levels. The
higher level is a system of teaching quality assessment indices. The middle level
subordinates to the higher level, and it consists of factors including teaching content,
methods, attitude and effectiveness. The lower level consists of 12 specific indices that
influence the middle level.
Y ¼ w1 f1 þ w2 f2 þ w3 f3 þ w4 f4 ð1Þ
Y w¼nw ð3Þ
If M meets the condition of consistency, f1, f2, f3, and f4, the four factors’ weights of
the middle layer of target Y, can be obtained by solving Eq. (3). All the weights of the
12 indices can be obtained in the same way. The eventually established teaching quality
assessment system is shown in Table 1.
Step 2: Initialize the weight vector Wi(0) as well as the learning rate r(0).
Step 3: Choose the input vector X from the training sample set.
Step 4: The winning neuron c can be obtained according to Eq. (4).
kX Wc k ¼ minkX Wi k ði ¼ 1; 2; 3; ; LÞ ð4Þ
i
Step 5: Judge the classification is correct or not. Adjust the weight vector of the
winning neuron according to the following rules:
Let RWc be the class related to the weight vector of the winning neuron, and RXI be
the class that is associated with the input vector. If RXI ¼ RWc , then Eq. (5) is tenable.
Maintain the weight vectors as they are for the rest of the neurons.
Step 6: Adjust the learning rate r(n), and it can be calculated Eq. (7) as:
n
rðnÞ ¼ rð0Þ 1 ð7Þ
N
4 Simulation Experiment
4.1 Collection of Original Sample Data
In this paper, an LVQ network is built on the platform of MATLAB R2013a to conduct
a simulation experiment. Original assessment samples are descriptive statistics of 60
college English teachers’ teaching. There are 12 indices assessed for each sample. The
full score for each index is 10, and the grades of each teacher are given by teaching
supervisors, the fellow teachers and students in the same university. Extreme data in the
grades are deleted to ensure the objectivity and effectiveness of the grading. The
average scores of each index of different teachers are computed and rated as A, B, C and
D. The data are shown in Table 2.
When the neuron number is 16, the network achieves the utmost stability and the
highest convergence speed. Therefore, the structure of the network model is 12-16-4.
The learning function is learnlv1; the target precision is 0.01; the learning rate is 0.1.
1 0 0 0 100%
1
2.5% 0.0% 0.0% 0.0% 0.0%
2 16 0 0 88.9%
2
5.0% 40.0% 0.0% 0.0% 11.1%
output class
0 0 12 0 100%
3
0.0% 0.0% 30.0% 0.0% 0.0%
0 0 0 9 100%
4
0.0% 0.0% 0.0% 22.5% 0.0%
1 2 3 4
target class
1
Data
0.9 Fit
Y=T
0.8 R=0.86176
0.7
actual output classes
0.6
0.5
0.4
0.3
0.2
0.1
0
0 0.2 0.4 0.6 0.8 1
target outpnut classes
is achieved. For the AHP-LVQ network to achieve the target accuracy, only 581
iterations are needed, while the iteration times for the AHP-BPNN is 3669. The
comparison indicates that the learning efficiency of the AHP-LVQ network is much
higher.
498 S. Hu et al.
0
10
-1
10
-2
10
-3
10
0 100 200 300 400 500
581 Epochs
0
10
-1
Mean Squared Error (mse)
10
-2
10
-3
10
0 500 1000 1500 2000 2500 3000 3500
3669 Epochs
5 Conclusion
References
1. Ďurišová, M., Kucharčíková, A., Tokarčíková, E.: Assessment of higher education
teaching outcomes (quality of higher education). J. Procedia-Soc. Behav. Sci. 174, 2497–
2502 (2015)
2. Yan, W.: Application research of data mining technology about teaching quality assessment
in colleges and universities. J. Procedia Eng. 15, 4241–4245 (2011)
3. Ghonji, M., Khoshnodifar, Z., Hosseini, S.M., Mazloumzadeh, S.M.: Analysis of the some
effective teaching quality factors within faculty members of agricultural and natural
resources colleges in Tehran University. J. Saudi Soc. Agric. Sci. 15, 1–7 (2013)
4. Li, B.: Application of university’s teaching quality evaluation based on support vector
machine. J. Comput. Simul. 28, 402–405 (2011)
5. Cai, Z., Chen, X., Shi, W.: Improved learning effect synthetic evaluation method based on
back propagation neural network. J. Chongqing Univ. (Nat. Sci. Ed.) 30, 96–99 (2007)
6. Wei, Z., Yan, K., Su, Y.: Model and simulation of maximum entropy neural network for
teaching quality evaluation. J. Comput. Simul. 30, 284–287 (2013)
7. Zhang, J., Liang, N.: Teaching quality appraisal model based on multivariate
statistical-neural network. J. Nat. Sci. J. Hainan Univ. 28, 188–192 (2010)
8. Zheng, Y., Chen, Y.: Research on evaluation model of university teachers’ teaching quality
based on BP neural network. J. Chongqing Inst. Technol. (Nat. Sci.) 29, 85–90 (2015)
9. Sang, Q.: Study on bilingual teaching evaluation system based on neural network.
J. Jiangnan Univ. (Nat. Sci. Ed.) 9, 274–278 (2010)
500 S. Hu et al.
10. Ding, S., Chang, X., Wu, Q.: Comparative study on application of LMBP and RBF neural
networks in ECS characteristic curve fitting. J. Jilin Univ. (Inf. Sci. Ed.) 31, 203–209 (2013)
11. Wang, K., Ren, Z., Gu, L., et al.: Research about mine ventilator fault diagnosis based on
LVQ neural network. J. Coal Mine Mach. 32, 256–258 (2011)
12. Satty, T.L.: Fundamentals of Decision Making and Priority Theory. RWS Publications,
Pittsburgh (2001)
13. Feng, Y., Yu, G., Zhou, H.: Teaching quality evaluation model based on neural network and
analytic hierarchy process. J. Comput. Eng. Appl. 49, 235–238 (2013)
14. Ya-ni, Z.: Teaching quality evaluation based on intelligent optimization algorithms.
J. Hangzhou Dianzi Univ. 34, 66–70 (2014)
15. Ding, S., Chang, X., Wu, Q., et al.: Study on wind turbine gearbox fault diagnosis based on
LVQ neural network. J Mod. Electron. Tech. 37, 150–152 (2014)
16. Ding, S., Chang, X., Wu, Q.: A study on the application of learning vector quantization
neural network in pattern classification. J. Appl. Mech. Mater. 525, 657–660 (2014)
17. Zhao, P., Gu, L.: Diagnosis of vibration fault for asynchronous motors based on LVQ neural
networks. J. Mach. Build. Autom. 39, 172–174 (2010)
18. Ding, S., Wu, Q.: Performance comparison of function approximation based on improved
BP neural network. J. Comput. Modernization 11, 10–13 (2012)
Fast Sparse Representation Classification
Using Transfer Learning
1 Introduction
classification. The computation efficiency of SRC constrains its applications. The main
computation time of SRC is consumed in solving the sparse representation coefficients.
This part of time increases greatly as the dimensionality of the sample increases.
Consequently, the SRC is very time consuming or even unfeasible in many face
recognition problems. Therefore, it is necessary to improve the classification efficiency
of SRC. Learning the model from the other domain may be much easier and appro-
priate for classifying the data in the original domain, which is the main idea of transfer
learning [6]. Transfer learning allows the domains used in training and test to be
different, such as transfer via automatic mapping and revision (TAMAR) algorithm [7].
This paper aims to speed up the classification procedure of SRC by using transfer
learning. Suppose that there exist coupled representations, i.e. high-dimensional rep-
resentation (HR) and low-dimensional representation (LR), of the same image, the
training samples in these two representations compose a pair of dictionaries. We
assume the image has the similar sparse representation model on this pair of dic-
tionaries. This assumption allows us to get the approximate solution of the coefficients
in a low-dimensional space with a relatively low computation cost. In our method, we
first convert the original (HR) test and training samples to the low high-dimensional
space by K-L transform, and get the LR of the samples. The coefficients are learned by
sparsely coding the LR test sample on the low-dimensional dictionary. Then, we
reconstruct the HR test image by each class-specific HR face images with the obtained
sparse coefficients. Finally, we classify the sample according to reconstruction error. It
should be noted that our method is distinctly different from PCA+SRC that is the SRC
performed in PCA space. Because the representation model of our method is learned in
PCA space, the representation error of our method is calculated in original space.
Otherwise, if representation error is directly calculated in the low-dimensional space,
some discriminative features may be lost. This may explain why SRC after PCA,
random-sampling or sub-sampling does not perform well shown in [6].
2 Preliminaries
2.1 SRC
P there are n training samples from t classes. Class i has ni training samples, and
Suppose
n ¼ ni . Each image sample is stretched to column vector. Image sample is repre-
sented by xi 2Rm1 ; m is the dimensionality of the sample. The test sample, e.g.
y2Rm1 , is represented as:
y ¼ XA ð1Þ
where, X ¼ ½x1 ; x2 ; . . .; xn ,l is the regular parameter, jj jj 1 denotes the l1 norm, and
jj jj 2 denotes the l2 norm. For example, for the vector v ¼ ðv1 ; v2 ; . . .; vk Þ, its l1 norm is
Fast Sparse Representation Classification Using Transfer Learning 503
sffiffiffiffiffiffiffiffiffiffiffi
P
k P k
jvi j, and the l2 norm is v2i . After getting the coefficients, the representation
i¼1 i¼1
error by each class can be derived by:
2.2 PCA
PCA or Karhunen-Loeve (KL) transform is a useful dimensionality reduction used in
signal processing. PCA finds d directions in which the data has the largest variances,
and projects the data along these directions. The covariance matrixis defined as:
X
n
C¼ ðxj xÞT ðxj xÞ; ðj ¼ 1; 2; . . .; nÞ ð4Þ
j¼1
P
n
where x ¼ 1n xj . The transform projection vectors are d eigen-vectors, p1 ; p2 ; . . .; pd ,
j¼1
of the d largest eigen-values of the covariance matrix. PCA transform space is defined
as U ¼ spanfp1 ; p2 ; . . .; pd g.
In many face recognition methods, PCA is usually used for dimensionality
reduction before the classification. E.g., people often use the framework of PCA plus
SRC. Indeed, performing PCA before SRC has the lower computational cost than SRC,
and makes the solution of the combination sparser. But classifying the data using SRC
in PCA space, i.e. PCA+SRC, cannot achieve promising accuracy, even much worse
than that in original space. In next Section, we propose a novel framework than can
speed up the classification of SRC without decrease of accuracy.
3 Fast SRC
SRC needs to solve l1 norm minimization problem. Taking into account the very high
dimensionality of the face image, SRC is very time consuming in face recognition. We
aim to develop an efficient way to learn the representation model of SRC.
Proposition 1. The test image sample y can be coded by Xa in the original space, where
a ¼ arg minðjjy Xajj22 þ ljjajj 1 Þ. We use the function f to denote the transform from
the original space to the new space, and then the test image in the new space, i.e., f ðyÞ,
can be coded by f ðXÞa0 , where a0 ¼ arg minðjjf ðyÞ f ðXÞa0 jj22 þ ljja0 jj 1 Þ. There exist
some transforms from the original space to the new space, by which A is very close to A0 .
504 Q. Zhu et al.
For example, with subsampling, the models in original space and sub-sampled
space should be similar. By the l1 norm optimizer, image y can be sparsely represented
by c1 x1 þ c2 x2 þ 0x3 þ 0x4 þ 0x5 , (c1 6¼ 0 and c2 6¼ 0), and y’ can be sparsely repre-
sented by c01 x01 þ c02 x02 þ 0x03 þ 0x04 þ 0x05 , (c01 6¼ 0 and c02 6¼ 0). In the first model, if the
representation coefficient vector ðc1 ; c2 ; 0; 0; 0Þ is replaced by ðc1 0 ; c2 0 ; 0 ; 0 ; 0Þ,
the representation result of y becomes c01 x1 þ c02 x2 þ 0x3 þ 0x4 þ 0x5 . In this example,
the coefficient vector in the sub-sampled space should be an approximate or suboptimal
solution of the coefficient vector in the original space.
In many image classification problems, the original test and training samples are
always in high-dimensional representation (HR) space. In framework shown in Fig. 1,
we project them onto the subspace, and low-dimensional representation (LR) of the
images. The test face image is reconstructed class by class using HR face images with
the corresponding sparse representation coefficients learned on the LR dictionary. After
devising the framework of Fast SRC, we need to find the transform (from dictionary in
HR to dictionary in LR) having the following two properties: (1) computationally
efficient to calculate the transform (low cost for getting the dictionary in LR) (2) the
similar sparse representation model to that in the original space. In the next two
subsections, we will demonstrate K-L transform (PCA) meets the above requirements.
We will show the relationship between the representation modelsin original space
and K-L transformspace by the following intuitive explanation. It is reasonable to
assume the prior probability distribution of the training samples coincides with the real
probability distribution of the samples. Let p1 ; p2 ; . . .; pd be all the d orthonormal
eigenvectors having non-zero eigenvalues obtained by PCA, and we denote
P ¼ ðp1 ; p2 ; . . .; pd Þ. If y is coded by Xa, where a ¼ arg minðjjy Xajj22 þ ljjajj 1 Þ, we
have a ¼ arg minðjjPT y PT Xajj22 þ l1 jjajj 1 Þ, where l1 is the scalar constant.
Because, we know that the l2 norm of the vector transformed into the K-L transform
space is equal to that in original space. Then, if jjy Xajj22 [ jjy Xa0 jj22 , we have
jjPT y PT Xajj22 [ jjPT y PT Xa0 jj22 . Hence, we can determine a is also a vector of the
same test sample y in the K-L transform space.
Fast Sparse Representation Classification Using Transfer Learning 505
Fig. 2. Some face images and the reconstructed images by SRC and FSRC, respectively.
model, the time complexity of PCA procedure for each test sample can be considered as
the O(cm2 =t), where m and t denote dimensionality of the sample and test sample size,
respectively. FSRC calculates representation error in original space, and the time
complexity of this part is O(mn). The first two columns of Table 1 give the computa-
tional complexity comparison between SRC and FSRC. Clearly, FSRC can reduce the
computational complexity from O(m3 ) to O(c3 + cm2 =t), where c is much less than m.
4 Experiments
ORL, FERET, Extended Yale B and AR datasets were used in the experiments [8–11].
On ORL dataset, we randomly choose 5 images from each class for training, and the
others are used as test samples. To fairly compare the performance, the experiments are
repeated 30 times with 30 randomly possible selections of the training images.
On FERET dataset, we randomly select 4 images from each subject for training. The
experiments are repeated with 10 randomly possible selections. On the Extended
Yale B dataset [10], 5 images of each subject were selected for training. The experi-
ments are also repeated 10 times. The experiments are carried on the above three
datasets, respectively. Two state-of-the-art methods CRC and SRC are employed as
comparisons. Table 1 shows the classification accuracies of the methods SRC,CRC,
PCAcFSRC and FSRC. From the results, we find our method achieves the comparable
classification accuracy to SRC and CRC. Table 2 shows that the classification effi-
ciency of FSRC is much higher than SRC on the first three datasets. In PCA+SRC, the
representation error is directly calculated in the low-dimensional space. In the proposed
method of FSRC, the representation error is calculated in the original space. For
revealing the correlation between where the representation error is calculated and the
accuracy, we also carried PCA+SRC on the datasets. Compared to the proposed
method of FSRC, PCA+SRC always obtains a lower accuracy. In computation, FSRC
achieves the higher classification efficiency than SRC. Each image of AR dataset is
occupied by the Gaussian noise. The mean is 0 and variance is 0.01. Figure 3 gives the
26 noised images from the first subject.
Fast Sparse Representation Classification Using Transfer Learning 507
Table 1. The classification accuracies (%) of the methods on four face datasets
SRC CRC PCA+SRC FSRC
ORL 92.08 91.72 88.53 92.62
FERET 65.00 58.83 54.53 64.50
Yale B 78.37 78.19 67.40 77.25
AR with noise 93.33 91.27 87.11 92.06
Table 2. The classification time (s) of the methods on four face datasets
SRC CRC PCA+SRC Random+SRC Subsample+SRC FSRC
ORL 1563.4 353.2 683.5 572.9 558.4 734.3
FERET 23752.1 894.3 2135.6 1443.2 1526.7 2556.3
Yale B 45958.0 2375.2 4772.4 3795.2 3419.5 5218.4
AR 21330.5 1372.7 4583.1 2836.8 2583.1 4036.7
Fig. 4. The classification accuracies of the four fast SRC methods on AR dataset.
508 Q. Zhu et al.
The classification accuracies and time of the methods SRC, CRC, PCA+SRC and
FSRC on this dataset are shown in the last row of Tables 1 and 2, respectively. As we
know, some feature selection or extraction plus SRC based methods, such as PCA
+SRC, Random+SRC and Subsample+SRC [6], also have higher efficiency than SRC.
Our method is distinctly different from the above methods. Seeing from step 4 of our
algorithm, it is clear that all the features are participated in our classification method.
PCA+SRC, Random+SRC and Subsample+SRC are also employed in the experiments
for comparisons. The classification accuracies of our method, PCA+SRC, Random
+SRC and Subsample+SRC, as their dimensionalities range from 5 to 100 with the
interval of 5, are shown in Fig. 4. The Subsample+SRC case shows two significant
drops in classification accuracy when the dimensionality is 40 and 80. We belive this
phenomenon is led by the drawback of subsample method. When extracting the fea-
tures, subsample method considers the position of the pixel rather than the discriminant
information. Some pixels having discriminant information may be lost in sub-sample
method. Then the features generated by sub-sample method may be not appropriate to
classify using SRC. The results show that FSRC outperforms the other three methods
with the same number of dimensionality. Table 2 also shows the running time of these
four fast SRC methods on the noised AR dataset. Seeing form Fig. 4 and Table 2, we
find that the four fast SRC methods cost the roughly equal running time, and our
method achieves the best classification performance.
5 Conclusion
This paper proposed a fast sparse representation classification (FSRC) algorithm for
face recognition. FSRC learns the approximate solution of the sparse representation
coefficients in a low-dimensional space with a relatively low computation cost. Based
on the idea of transfer learning, FSRCreconstructs the test image in original space
rather than low-dimensional space. Therefore, FSRC can achieve the comparable
accuracy to SRC, and much higher computational efficiency. It is necessary to point
that the framework ofFSRC is independent on optimization algorithms. We evaluated
the proposed method on four face datasets. Compared with SRC, FSRC is with sig-
nificantly lower complexity and has very competitive accuracy. Compared with PCA
+SRC and the other two SRC based fast classification frameworks, FSRC achieves the
best classification results.
References
1. Wright, J., Yang, A.Y., Ganesh, A., Sastry, S.S., Ma, Y.: Robust face recognition via sparse
representation. IEEE TPAMI 31(2), 1–17 (2009)
2. Wagner, A., Wright, J., Ganesh, A., Zhou, Z., Ma, Y.: Towards a practical face recognition
system: robust registration and illumination via sparse representation. IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2009
3. Zhu, Q., Sun, H., Feng, Q.X., Wang, J.H.: CCEDA: building bridge between subspace
projection learning and sparse representation-based classification. Electron. Lett. 50(25),
1919–1921 (2014)
4. Yang, M., Zhang, L., Zhang, D., Wang, S.: Relaxed collaborative representation for pattern
classification. In: CVPR (2012)
5. Xu, Y., Zhang, D., Yang, J., Yang, J.-Y.: A two-phase test sample sparse representation
method for use with face recognition. IEEE Trans. Circ. Syst. Video Technol. 21(9), 1255–
1262 (2011)
6. Pan, SinnoJialin, Yang, Qiang: A survey on transfer learning. TKDE 22(10), 1345–1359
(2010)
7. Yang, Qiang, Pan, SinnoJialin, Zheng, Vincent W.: Estimating location using Wi-Fi. IEEE
Intell. Syst. 23(1), 8–13 (2008)
8. http://www.uk.research.att.com/facedataset.html
9. Phillips, P.J., Moon, H., Rizvi, S.A., Rauss, P.: The FERET evaluation methodology for face
recognition algorithms. IEEE TPAMI 22, 1090–1104 (2000)
10. Lee, K., Ho, J., Kriegman, D.: Acquring linear subspaces for face recognition under variable
lighting. IEEE TPAMI 27, 684–698 (2005)
11. Martinez, A., Benavente, R.: The AR face database, The Computer Vision Center, Technical
report (1998)
Probing the Scheduling Algorithms in the Cloud
Based on OpenStack
Yang Luo(B) , Qingni Shen, Cong Li, Kang Chen, and Zhonghai Wu
1 Introduction
Live virtual machine (VM) migration is the process of migrating a VM’s memory
content and processor states to another virtual machine monitor (VMM) even
as the VM continues to execute without halting itself [1]. It has become a key
selling point for recent clouds as it provides assistance for cloud service providers
(CSPs) in load balancing [2], consolidated management, fault tolerance [3,4] and
so on. Whereas, it as well introduces novel security threats, including aspects
such as control plane, data plane and migration module [5], respectively referred
to the securities of management commands, virtual machine data and migration
mechanism itself. [5] has empirically demonstrated the importance of migration
security, by implementing Xensploit to perform the data plane manipulation on
Xen and VMware. Since then, the data plane issues have been fully researched by
plenty of studies such as [6–8]. Opposite to data plane’s popularity, the research
on control plane related problems is very limited. [5] briefly investigated the
possible loopholes at the control plane including incoming migration control,
outgoing migration control and false resource advertising. Unfortunately, prac-
tical details were not offered in it. [9] extended trusted computing to virtualized
c Springer International Publishing Switzerland 2015
Z. Huang et al. (Eds.): ICCCS 2015, LNCS 9483, pp. 510–520, 2015.
DOI: 10.1007/978-3-319-27051-7 44
Probing the Scheduling Algorithms in the Cloud Based on OpenStack 511
systems using vTPM, and [10–12] assumed that the destination is trustworthy
and proposed secure transfer protocols based on vTPM for migration between
identical platforms. These protocols are yet to be seen for practice use given
their complexity. HyperSafe [13] is a lightweight approach that endows existing
Type-I bare-metal hypervisors with a unique self-protection capability to pro-
vide lifetime control-flow integrity, this technique only supports to guarantee the
hypervisors’ integrity and network protection is not considered.
Given the increasing popularity of live VM migration, a comprehensive under-
standing of its security is essential. However, the security of its control plane
has yet to be analyzed. This paper first explores threats against live virtual
machine migration in the context of three control plane threats: load balancing,
scheduling and transmission. We then present a scheduling algorithm reverse
approach (SARA) against the functionality of VM scheduling. Effectiveness of
SARA depends on the untrusted network assumption: the attacker is able to
intercept the transmission channel of scheduling control commands by compro-
mising the network or accessing as an insider. We provide a proof of concept for
SARA by automating the exploiting recent OpenStack scheduling mechanism
[14] during live migration, demonstrating that scheduling attack has become a
realistic security issue. This PoC can be used as a self-test tool for CSPs to
measure their defense strength against control plane attacks.
The remainder of the paper is organized as follows: Sect. 2 presents recent
control plane threat classes. Section 3 details the proposed SARA technique for
reversing scheduling algorithm. Section 4 presents the implementation and eval-
uation of the proposed technique. Lastly, conclusions and future work are pre-
sented in Sect. 5.
3 Proposed Technique
This section details the innovative SARA technique to reverse the scheduling
algorithm used during live migration in the cloud. As scheduling is performed
in two stages, the proposed SARA technique is also comprised of two parts:
filter determination and weigher determination. It takes VM migration requests,
metadata of hosts and the destination host (the scheduling result) as inputs, the
output is the decision result for filters and weighers.
where, r is the VM scheduling request, f is the filter, h is the host, and f ilter() is
the binary filter function, returning true if and only if the host h passes through
the filter f under the request r. Given the specific r, f , h, the return value of
f ilter() can be figured out. It is notable that SARA relies on multiple times of
scheduling, each of which is called an iteration. Assume that we have n iterations,
this indicates there are also n migration requests and n destination hosts in the
meantime.
Let Fall and F be the set of filters in total and the set of activated filters,
respectively. Most filters are “constant”, except some cases containing specific
numbers: As the IoOpsFilter [14], it requires the simultaneous IO operation
numbers of the destination host to be not over 8. There is much uncertainty about
this filter as another CSP might modify the IO operation numbers to another
value like 16 on its own behalf. We call these kinds of filters with exact figures
Type-II filters (denoted as Fv,all ), and the others Type-I filter (denoted as Fc,all ).
514 Y. Luo et al.
The activated filters in Fc,all and Fv,all are Fc and Fv respectively. Obviously
we have Fall = Fc,all ∪ Fv,all and F = Fc ∪ Fv .
Type-I filters. In the following, we intend to discuss about how to gain the set
of all activated Type-I filters at each iteration i, denoted as Fi .
where, ri is the i-th migration request, hdst,i is the i-th destination host. Par-
ticularly for initial value, we have F0 = Fc,all . Fi is calculated based on Fi−1
because hdst,i for all iterations must pass through all activated filters. The final
filter set will be the intersection of Fi at all iterations. For performance, we can
just directly figure out Fi based on Fi−1 and spare the intersection operations.
Through iteratively applying (2) for n times, Fn can be figured out. It is worth
noting that the filters excluded from Fn cannot fulfill our scheduling require-
ments, doesn’t mean that filters within Fn are all necessary. In fact, any subset
of Fn is able to let all destination hosts pass through the filtering, thus strictly
Fc ⊆ Fn holds, in this work we will suppose Fc ≈ Fn , as it occurs with the
greatest probability. When request number rises to a very large value, the error
between Fc and Fn is small enough and can be omitted.
Type-II filters. Fv is defined as filters with one exact figure, either appears in
the request parameter (Type II-1) or the host parameter position (Type II-2).
IoOpsFilter belongs to the former class, a case of the latter is not yet available
in OpenStack [14], however this situation may change in future. The difficulty
of calculating Fv lies in that we are unaware of the exact parameter value used
by a Type-II filter. It needs to be estimated in some manner. For simplicity,
we merely explore the Type II-1 condition, as Type II-2 ones can be resolved
similarly. The filter function for Type II-1 filters is given in (3).
where, f (x) is the Type-II filter with exact value x in the place of the original
request parameter, based on which, we do not need a request variable here.
hdst,i (yi ) is the destination host for iteration i with host parameter yi . As x and
yi are both figures, a common case is that the scheduler compares the value of
x and yi and decides whether the host can pass through this filter. The value of
x must satisfy f ilter(f (x), hdst,i (yi )) = true for each iteration i. Furthermore,
we can reasonably suppose that f (x) actually filtered some hosts out, as CSPs
have no desire to deploy a filter that has no effects. Therefore, we have:
where, hj (yj ) is any non-destination host in each iteration, Fx is the set of all
activated Type-II filters. (4) indicates that f (x) belongs to Fx if and only if
f (x) passes through each hdst,i (yi ) and refuses at least one hj (yj ). Assume the
filtering pattern for f (x) is x ≥ yi (similarly with x ≤ yi ), obviously we have
Probing the Scheduling Algorithms in the Cloud Based on OpenStack 515
where, w is the weigher, h is the host, and weigh() is the weighing function,
returning the weight score the weigher w gives the host h. The higher score
a host can obtain, the larger possibility it will be accepted as the migration
destination.
p
weight(h) = mi × weigh(wi , h), (7)
i=1
where, p is the number of weighers, wi is the i-th weigher, mi is the multiplier for
wi , and weight() is the weighted summation function that sums up weight scores
from all weighers and returns the final weight for a host. After weighing step,
the hosts will be sorted by final weights in descending order. Besides, we assume
that the scheduling randomization is disabled in our scenario, as implemented
in OpenStack by default, so migration destination will always be the top one in
the host list.
We have come forward with a mathematical method to figure out the multi-
pliers. First we need to group iterations with identical filtered host lists together.
By observing the iterations in request timestamp ascending order, we discov-
ered a pattern: the migration destination of the previous request tends to be
selected again for the current request, and this will not change, until the favored
host is consumed too much to maintain the first position in the host list (con-
sumption of the host will cause degradation of its weight). We call the alter-
ation happened between two destination hosts within a group a host shifting.
Take the host shifting from hA to hB for instance, before shifting we have
weight(hA ) > weight(hB ), and after shifting we get weight(hA ) < weight(hB ).
Combining above two inequalities, we can roughly draw a conclusion:
As a cloud usually owns hosts on a massive scale, the error between two
sides of (8) can be reasonably ignored, so it will be regarded as an equation
here. Substitute (7) into (8), we have the host shifting equation:
p
mi × (weigh(wi , hA ) + weigh(wi , hA )) =
i=1
(9)
p
mi × (weigh(wi , hB ) + weigh(wi , hB )),
i=1
Through observing the data in Table 1, we found that: (i) Precision is pos-
itively correlated with the dataset scale, so larger datasets can achieve better
results. Precision can be 100 % if we have sufficient requests. (ii) Most deac-
tivated filters are found by the first several requests, as the precision for the
first 10 requests has reached 86.05 %. (iii) SARA approach works by eliminating
impossible filters, thus the more filters a CSP has activated in the cloud, the less
remaining filters are going to be determined, which makes it easier for SARA to
achieve a good precision.
⎛ ⎞
1.0000 1.0000 1.0000 1.0
⎜ −0.6046 0.0859 0.6754 0.0 ⎟
⎜ ⎟
⎜ 0.0348 0.0176 0.0054 0.0 ⎟
⎜ ⎟
⎜ 1.8193 −1.0514 0.0084 0.0 ⎟
A, β = ⎜
⎜ 2.0000
⎟ (10)
⎜ −1.9663 1.9960 0.0 ⎟
⎟
⎜ 0.6985 0.1145 −1.3094 0.0 ⎟
⎜ ⎟
⎝ 2.0000 −1.0000 −2.0000 0.0 ⎠
−1.0000 −1.0000 −1.0000 0.0
Dataset m1 m2 m3 Error
(100, 20) N/A N/A N/A N/A
(200, 20) 0.6610 1.1030 -0.7640 1.1930
(400, 20) 0.3868 0.5246 0.0886 0.1434
(600, 40) 0.2924 0.4370 0.2706 0.0949
(800, 40) 0.2947 0.4573 0.2480 0.0645
(100, 20)’ N/A N/A N/A N/A
(200, 20)’ 0.3839 0.6161 0.0000 0.2460
(400, 20)’ 0.3070 0.5103 0.1827 0.0213
(600, 40)’ 0.2926 0.5076 0.1998 0.0106
(800, 40)’ 0.2912 0.4949 0.2139 0.0172
*expected 0.3000 0.5000 0.2000 0.0000
request datesets show SARA can reliably determine the filters and weighers
deployed by CSPs in OpenStack, indicating how vulnerable wide-spread cloud
software is to network sniffing attacks. In order to secure migration between vir-
tual machines, encryption should be enforced on not only the data plane, but
also the control plane, such as the scheduling commands transmitting channel. In
our ongoing work SARA technique will be improved by adding anti-randomizing
support, and we are also exploring the security issues on load balancing, to inves-
tigate whether its algorithm can either be reversed through traffic analyzing.
Acknowledgments. We thank the reviewers for their help improving this paper.
This work is supported by the National High Technology Research and Development
Program (“863” Program) of China under Grant No. 2015AA016009, the National
Natural Science Foundation of China under Grant No. 61232005, and the Science and
Technology Program of Shen Zhen, China under Grant No. JSGG20140516162852628.
References
1. Clark, C., Fraser, K., Hand, S., Hansen, J.G., Jul, E., Limpach, C., Warfield,
A.: Live migration of virtual machines. In: Proceedings of the 2nd Conference on
Symposium on Networked Systems Design & Implementation, vol. 2, pp. 273–286.
USENIX Association (2005)
2. Forsman, M., Glad, A., Lundberg, L., Ilie, D.: Algorithms for automated live migra-
tion of virtual machines. J. Syst. Softw. 101, 110–126 (2015)
3. Meneses, E., Ni, X., Zheng, G., Mendes, C.L., Kale, L.V.: Using migratable objects
to enhance fault tolerance schemes in supercomputers. IEEE Trans. Parallel Dis-
trib. Syst. 26(7), 2061–2074 (2014)
4. Yang, C.T., Liu, J.C., Hsu, C.H., Chou, W.L.: On improvement of cloud virtual
machine availability with virtualization fault tolerance mechanism. J. Supercom-
puting 69(3), 1103–1122 (2014)
5. Oberheide, J., Cooke, E., Jahanian, F.: Empirical exploitation of live virtual
machine migration. In: Proceedings of BlackHat DC Convention 2008
6. Ver, M.: Dynamic load balancing based on live migration of virtual machines:
security threats and effects. Rochester Institute of Technology (2011)
7. Perez-Botero, D.: A Brief Tutorial on Live Virtual Machine Migration From a
Security Perspective. University of Princeton, USA (2011)
8. Duncan, A., Creese, S., Goldsmith, M., Quinton, J.S.: Cloud computing: insider
attacks on virtual machines during migration. In: 2013 12th IEEE International
Conference on Trust, Security and Privacy in Computing and Communications
(TrustCom), pp. 493–500. IEEE (2013)
9. Perez, R., Sailer, R., van Doorn, L.: vTPM: virtualizing the trusted platform mod-
ule. In: Proceedings of the 15th Conference on USENIX Security Symposium, pp.
305–320 (2006)
10. Zhang, F., Huang, Y., Wang, H., Chen, H., Zang, B.: PALM: security preserving
VM live migration for systems with VMM-enforced protection. In: Third Asia-
Pacific Trusted Infrastructure Technologies Conference, APTC 2008, pp. 9–18.
IEEE (2008)
11. Masti, R.J.: On the security of virtual machine migration and related topics. Master
Thesis, Eidgenossische Technische Hochschule Zurich (2010)
520 Y. Luo et al.
12. Aslam, M., Gehrmann, C., Bjorkman, M.: Security and trust preserving VM migra-
tions in public clouds. In: 2012 IEEE 11th International Conference on Trust,
Security and Privacy in Computing and Communications (TrustCom), pp. 869–
876. IEEE (2012)
13. Wang, Z., Jiang, X.: Hypersafe: a lightweight approach to provide lifetime hyper-
visor control-flow integrity. In: 2010 IEEE Symposium on Security and Privacy
(SP), pp. 380–395. IEEE (2010)
14. Scheduling - OpenStack Configuration Reference - juno. http://docs.openstack.
org/juno/config-reference/content/section compute-scheduler.html
15. Hines, M.R., Deshpande, U., Gopalan, K.: Post-copy live migration of virtual
machines. ACM SIGOPS Oper. Syst. Rev. 43(3), 14–26 (2009)
16. Zhang, Y., Juels, A., Reiter, M.K., Ristenpart, T.: Cross-VM side channels and
their use to extract private keys. In: Proceedings of the 2012 ACM Conference on
Computer and Communications Security, pp. 305–316. ACM (2012)
17. Vinoski, S.: Advanced message queuing protocol. IEEE Internet Comput. 6, 87–89
(2006)
18. Baxter, J.H.: Wireshark Essentials. Packt Publishing Ltd, UK (2014)
19. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J.,
Sorensen, D.: LAPACK Users’ Guide, vol. 9. SIAM, Philadelphia (1999)
20. Sanderson, C.: Armadillo: an open source C++ linear algebra library for fast pro-
totyping and computationally intensive experiments (2010)
Top-k Distance-Based Outlier Detection
on Uncertain Data
Abstract. In recent years, more researchers are studying uncertain data with the
development of Internet of Things. The technique of outlier detection is one of
the significant branches of emerging uncertain database. In existing algorithms,
parameters are difficult to set, and expansibility is poor when used in large data
sets. Aimed at these shortcomings, a top-k distance-based outlier detection
algorithm on uncertain data is proposed. This algorithm applies dynamic pro-
gramming theory to calculate outlier possibility and greatly improves the effi-
ciency. Furthermore, an efficient virtual grid-based optimization approach is also
proposed to greatly improve our algorithm’s efficiency. The theoretical analysis
and experimental results fully prove that the algorithm is feasible and efficient.
1 Introduction
In the real world, data contains uncertainty for reasons that include poor precision of
equipment, absence of data and transmission delay. The new data pattern has become a
new research hotspot, outliers detection is one of them. Outlier detection is a funda-
mental issue in data mining. It has been applied in lots of areas including network
intrusion detection [1], industrial sciences [2], environmental monitoring [3], credit
card fraud detection [4], etc. The outlier is data with abnormal behavior or charac-
teristic that obviously deviates from other sample data. The main task of outlier
detection is to find outlier from a large number of complex dataset [5].
The outlier detection comes from the statistics, and later it was introduced into the
data mining domain by Knorr [6, 7]. Some outlier detection algorithms are commonly
used at present, such as distance-based outlier detection and density-based outlier
detection. However, all these outlier detection algorithms mainly focus on deterministic
data and cannot directly process uncertain data, most existing outlier detection algo-
rithms on uncertain data are the improvement and innovation on the basis of the above
algorithms.
At present, the outlier detection technology of deterministic data has matured and is
used cosmically, but research about uncertain data is just beginning. The first definition
of outlier on uncertain data was given by Aggarwal C C, Yu P S, et al. [8]. According to
The main difference between certain and uncertain data is that uncertain data
contain uncertain record and their existential probability. So any combination of tuples
in ud-neighborhood of the data needs to be unfolded. Then possible world model was
built. There are many models of uncertain data, but possible world model is the most
popular one [14].
Definition 2 (Probability of the possible world instance (PPW)). For uncertain data
xi, an arbitrary combination of tuples in R(xi) constitutes a possible world instance
(PW), all possible world instances constitute possible world of xi. We denote the
possibility of possible world instance by PPW, can be computed as follows:
Y Y
PPW ¼ pj ð1 pkÞ ð1Þ
xj2RðxiÞ^xj2PW xk2RðxiÞ^xk62PW
Where xj and xk are objects in R(xi), xj is in the possible world instance PW and xk is
not in PW, pj and pk are existential probability of xj and xk respectively.
When considering deterministic data, a distance-based outlier is a data who has not
enough neighbors within a given distance threshold [15]. For uncertain data, however,
existential probability needs to be taken into account in detection. We can only judge
whether an object is an outlier through calculating the probability of the object to
become an outlier.
Definition 3 (Outlier probability). For an uncertain data xi, we define the sum of the
probabilities of possible world instances that contain less than n tuples as its outlier
probability. Then,
X
PoutlierðxiÞ ¼ PPWj ð2Þ
PWj 2Snðxi;PWÞ
Where Poutlier(xi) denotes the outlier probability of xi. Sn(xi, PW) is the set of
possible world instances that contain less than n tuples in possible world of xi.
Definition 4 (Top-k Outlier). We sort all uncertain data in descending order of their
outlier probability, and collect the preceding k objects as top-k outliers on uncertain
data.
If R(xi) is ud-neighborhood of uncertain data xi, let R(xi) = {x0, x1,…,xm-1}. [R(xi), j]
denotes event that only occurs j tuples in R(xi). We find all events in [R(xi), j] are
exactly possible world instances that contain j tuples in R(xi). According to this, we can
use a meaningful equivalent conversion for computing the outlier probability of an
object. We list all the possible events. Then,
½RðxiÞ\n ¼ ½RðxiÞ; 0[½RðxiÞ; 1[ [½RðxiÞ; n 1 ð3Þ
Where [R(xi) < n] denotes all events that occur less than n tuples in R(xi), [R(xi), j]
(j = 0, 1, … ,n-1) denotes the event that only occurs j tuples in R(xi).
Then, the outlier probability of xi can be converted to another expression:
PoutlierðxiÞ ¼ P½RðxiÞ\n ¼ P½RðxiÞ; 0 þ P½RðxiÞ; 1 þ þ P½RðxiÞ; n 1
X
n1
ð4Þ
¼ P½RðxiÞ; j
j¼0
Where P[R(xi) < n] denotes the probability of event [R(xi) < n]. P[R(xi), j] (j = 0, 1,
… ,n-1) denotes the probability of event [R(xi), j].
The problem becomes how to calculate P[R(xi), j] efficiently, R(xi) will be divided
into two parts based on the dynamic programming theory: the last tuple and the rest of
tuples. If the last tuple occurs, the next step is to calculate probability of the event that
only occurs j-1 tuples in the rest of tuples; if the last tuple doesn’t occur, the next step is
to calculate probability of the event that occurs j tuples in the rest of tuples.
The order of tuples in R(xi) remains unchanged during the calculation. |R(xi)| = m,
probabilities of tuples in R(xi) are represented p0, p1,…, pm-1 respectively. In this paper,
we use two-dimensional array to store the value of P[R(xi),j]. We need to create a
two-dimensional array T that contains m rows and n columns, T[i][j] denotes the
probability of event that only occurs j tuples in the dataset that consisted by the first
i tuples of R(xi). So P[R(xi) < n] is the sum of values of the last row in T, then,
X
n1
PoutlierðxiÞ ¼ P½RðxiÞ\n ¼ T½m½j ð5Þ
j¼0
The row number of array starts with 1 because it is meaningless when the formula
i = 0. Solving formulas of two-dimensional array are as follows:
8
> po if j ¼ 0; i ¼ 1
>
>
>
> po if j ¼ 1; i ¼ 1
>
< pi1 T½i 1½0 if j ¼ 0; i [ 1
T½i½j ¼ pi1 T[i 1½j 1 if j ¼ i; i [ 1 ð6Þ
>
>
>
> pi1 T[i 1½j 1 þ pi1 T[i 1½j if j 6¼ 0; j\i
>
>
:
0 if j [ i
Where Ri′ (i = 1,…, m1) is a data set that consists of the first i tuples appear of R1.
And P[R1 < n] = T[m1][0] + … + T[m1][n-1].
In this paper, the above dynamic programming algorithm is represented by DPA.
The above algorithm is called basic algorithm of top-k outlier detection or BA.
In the experiment, we find that BA needs to search ud-neighborhood and calculate
the outlier probability for each object, which inevitably brings high time complexity.
Then a virtual grid-based optimization algorithm is proposed to reduce the consump-
tion of the algorithm and optimize BA.
Where X is the coordinate value of the cell; checked denotes whether the cell is
checked; info denotes information of tuples in the cell, such as tuple set, the number of
tuples, the sum of probabilities of tuples; down point to the next cell in the same
column; right point to the next cell in the same row.
528 Y. Zhang et al.
For a given uncertain dataset D, we divide its domain into many mutual disjoint
square cells and calculate the number of cells in each dimension. According to the
number of cells in each dimension, we establish a total head node and two head nodes.
Sequentially reading tuple xi from uncertain dataset D, we add xi into the cell and
update information of the cell if the cell that contains xi is already in VG; otherwise, we
need to create a new cell node and insert it into cross list, then add xi into the cell and
update information of the cell. VG don’t finish until all tuples are read. Its structure is
shown in Fig. 4.
Each side of the cell is ud/2. Let Cx,y be a cell of 2-D space. L(Cx,y) = {Cu,v|u = x±2,
v = y±2,Cu,v ≠ Cx,y} denotes the neighbor cell set of Cx,y. Cell holds the following
natures: for 8xi 2 Cx;y , there is R(xi)2(Cx,y[L(Cx,y)), and Cx,y2R(xi). So we only need to
find ud-neighborhood of xi in L(Cx,y) when searching R(xi).
(2) Clustering
Once VG is constructed, we traverse the VG, and get a set of adjacent cells in the
first column (such as a and b in Fig. 4), then store them into a temporary list and cluster
set, then find adjacent cell in the same row (such as b and c in Fig. 4) of them and store
adjacent cell which does not exist in temporary list into cluster set, finally clear tem-
porary list. Repeat above steps until there is no cell in VG.
LC = {C1,…, Ci,…, CM} denotes the set of clusters after clustering. Let |Ci| is the
number of tuples in Ci, Counti is the number of cells in Ci. According to the properties
of the DPA, the outlier probability of an uncertain data is influenced by probability
distribution and tuple density. The probability of containing outliers in Ci is larger
when its sum of existential probability is smaller and the number of cells in Ci is larger.
Considering the aforementioned factors, we measure the probability of containing
outliers in Ci by the average probability of the cells in Ci. Let den be probability
threshold. Then,
Top-k Distance-Based Outlier Detection on Uncertain Data 529
P
jCi j
pj
j¼1
den ¼ ð7Þ
Counti
Where pj is the existential probability of the object in Ci. The smaller den is, the
greater the probability of containing outliers in Ci becomes.
In the process of the algorithm, δ is gradually increasing, the greater δ is, the more
objects can be pruned. So, optimization algorithm prioritizes clusters whose den are
minimum, then cells in the cluster are sorted by the sum of probabilities in ascending
order, and detecting outlier from cell whose sum of probabilities is minimum, which
makes δ rapidly increase.
(2) The Second Stage. In the second stage, we need to detect outlier for each cluster
based on pruning rule.
In the process of neighbor search or calculation of a data, we can immediately
interrupt the search or calculation if we can judge the non-outlier as early as possible.
So the following pruning rule is presented. M is a dataset that stores neighbors of the
object in the process of the algorithm.
Pruning rule: If P[M < n] ≤ δ, and M only contains part of the neighborhood of a
query object, this query object can be pruned as non-outlier. A special case: If P
[M < n] ≤ δ, and M only contains all objects of a cell, all objects in this cell can be
pruned as non-outlier.
In the process of judging whether uncertain data xi is an outlier, firstly, if the
number of objects in the minimum heap is less than k, then calculate outlier probability
of xi and insert xi into the minimum heap. Otherwise, all tuples in the cell that contains
xi are stored into M, then we calculate P[M < n], if P[M < n] ≤ δ, all tuples in this cell
can be pruned as non-outlier, otherwise, find a neighbor cell in the cluster that contains
xi, and search neighbors in the neighbor cell and store them into M, then calculate P
[M < n] (by DIC) when a cell is finished. If P[M < n] ≤ δ, xi is not an outlier. If it still
does not meet the pruning condition and has undetected neighbor cells when all
neighbor cells in the cluster that contains xi are detected, we need to find neighbor cells
in VG, repeat above calculation and judgment. If all neighbor cells are evaluated, P
[M < n] is still greater than δ, then remove the top object in minimum heap and insert xi
into minimum heap, and outlier probability of the top object in minimum heap is
assigned to δ, continue to test the next object.
The whole algorithm flowchart is shown in Fig. 5.
In the process of the algorithm, δ is gradually increasing, the vast majority of
objects only need to search a small part of ud-neighborhood, which can judge whether
the object is an outlier, thus save a lot of time.
Neighbor cells of the vast majority of objects are practically clustered in a same
cluster, and only a few objects need to search VG when searching their
ud-neighborhood in the first stage of the optimization algorithm.
In this paper, the above algorithm is called top-k virtual grid-based outlier detection
algorithm on uncertain data (VGUOD for short).
530 Y. Zhang et al.
4 Experiments
To evaluate the performance of the proposed algorithm objectively, in this chapter, the
experiment is taken on two synthetic datasets SynDS1 and SynDS2 and mainly focuses
on the influence of several parameters and comparisons in precision and running time
between VGUOD and GPA that proposed in [12]. Software environment: 2.93GHZ
CPU、2G main memory and Microsoft Windows7 Ultimate system. Experiment
platform: Microsoft Visual Studio 2010. Language: C ++.
Synthetic datasets are generated by Matlab. Each dataset has 100000 uncertain
records. Valued attributes are 2-dimensional and each dimension is floating point
number that distribute in [0, 1000], SynDS1 and SynDS2 are composed respectively of
several normal distributions and uniform distributions. Existential probability was
randomly generated in the range of (0, 1).
The number of outliers was increasing as k grew, and running time of two
approaches is increasing. However, as the Fig. 6 shows, the running time of VGUOD is
far less than running time of BA, because virtual grid structure can filter all empty cells
and find neighbors more easily, besides, pruning method based on virtual grid structure
can trim most non-outliers, so VGUOD can effectively save running time.
The relationship between ud and running time is illustrated in Fig. 7. The number of
ud-neighborhood of an object for different ud (ud is increasing from 10 to 28) is
increasing. BA needs to search the whole dataset and calculates the outlier probability
for each object, which inevitably costs more computation time. However, for VGUOD,
the value of P[M < n] in a cell is declining as ud grow. The smaller the value of P
[M < n] is, the greater the probability of meeting the pruning condition becomes.
So VGUOD needs less running time than BA.
Finally, the effects of the change of data size on the running time were discussed.
We used ten datasets that generated by Matlab and the number of records of them
varied from 20000 to 160000 to test BA and VGUOD. Figure 9. shows both the
running time of BA and VGUOD in different datasets. The more the number of records
in dataset, the more the amount of calculation and the running time consumption. BA
needs to calculate all objects, its computational effort remarkably increases with the
size of the dataset. The running time of VGUOD is far less than the time of BA because
of its pruning ability.
Where B denotes the total number of outliers found by the algorithm, and b is the
right number of outliers found by the algorithm.
Figures 10 and 11 respectively contrasts the precision and execution time cost of
GPA and VGUOD when they run in the same dataset to detect the same number of
outliers. Let ud = 20, n = 4.
By analyzing experimental results, we can observe that the VGUOD has advanced
performance in both running time and precision than GPA. VGUOD algorithm gets
top-k outliers, which guarantees it has a higher precision than GPA when detecting the
same number of outliers. In terms of running time, VGUOD algorithm filters empty
cells and judges whether an object is outlier by using less computation amounts, which
greatly saves running time.
5 Conclusions
As a new but important research field, Outlier detection on uncertain data has a good
extensive application prospect. In this paper, a new definition of outlier detection on
uncertain data is put forward, and then we introduce the dynamic programming idea to
efficiently calculate the outlier probability of the data, and propose an efficient virtual
grid-based optimization method. The algorithm adapts to detect outliers in large dataset
to a certain extent.
We will study more complex uncertain data model in the future work, and detecting
outlier on uncertain data in high dimensional data space.
References
1. Zhang, J., Zulkernine, M.: Anomaly based network intrusion detection with unsupervised
outlier detection. In: IEEE International Conference on Communications, ICC 2006,
pp. 2388–2393. IEEE (2006)
2. Alaydie, N., Fotouhi, F., Reddy, C.K., Soltanian-Zadeh, H.: Noise and outlier filtering in
heterogeneous medical data sources. In: 2012 23rd International Workshop on Database and
Expert Systems Applications, pp. 115–119. IEEE (2010)
3. Knorr, E.M., Ng, R.T.: Algorithms for mining distance-based outliers in large datasets. In:
Proceedings of the 24rd International Conference on Very Large Data Bases, pp. 392–403.
Morgan Kaufmann Publishers Inc. (1998)
4. Wang, L., Zou, L.: Research on algorithms for mining distance-based outliers. J. Electron.
14, 485–490 (2005)
5. Han, J., Kamber, M.: Data Mining–Concepts and Techniques 2nd ed. Data Mining Concepts
Models Methods & Algorithms Second Edition 10(9),1–18 (2006)
6. Knorr, E.M., Ng, R.T.: Finding intensional knowledge of distance-based outliers. In: VLDB,
pp. 211–222 (1999)
7. Knorr, E.M., Ng, R.T., Tucakov, V.: Distance-based outliers: algorithms and applications.
VLDB J. — Int. J. Very Large Data Bases 8(3–4), 237–253 (2000)
8. Aggarwal, C.C., Yu, P.S.: Outlier detection with uncertain data. In: SDM (2008)
9. Shaikh, S.A., Kitagawa, H.: Distance-based outlier detection on uncertain data of Gaussian
distribution. In: Sheng, Q.Z., Wang, G., Jensen, C.S., Xu, G. (eds.) APWeb 2012. LNCS,
vol. 7235, pp. 109–121. Springer, Heidelberg (2012)
10. Shaikh, S.A., Kitagawa, H.: Fast top-k distance-based outlier detection on uncertain data. In:
Wang, J., Xiong, H., Ishikawa, Y., Xu, J., Zhou, J. (eds.) WAIM 2013. LNCS, vol. 7923,
pp. 301–313. Springer, Heidelberg (2013)
Top-k Distance-Based Outlier Detection on Uncertain Data 535
11. Shaikh, S.A., Kitagawa, H.: Top-k outlier detection from uncertain data. Int. J. Autom.
Comput. 11(2), 128–142 (2014)
12. Wang, B., Xiao, G., Yu, H., et al.: Distance-based outlier detection on uncertain data. In:
IEEE Ninth International Conference on Computer & Information Technology,
pp. 293–298. IEEE (2009)
13. Wang, B., Yang, X.-C., Wang, G.-R., Ge, Yu.: Outlier detection over sliding windows for
probabilistic data streams. J. Comput. Sci. Technol. 25(3), 389–400 (2010)
14. Abiteboul, S., Kanellakis, P., Grahne, G.: On the representation and querying of sets of
possible worlds. In: PODS 2001, pp. 34–48 (1991)
15. Angiulli, F., Pizzuti, C.: Fast outlier detection in high dimensional spaces. In: Elomaa, T.,
Mannila, H., Toivonen, H. (eds.) PKDD 2002. LNCS (LNAI), vol. 2431, pp. 15–27.
Springer, Heidelberg (2002)
16. Dong, J., Cao, M., Huang, G., Ren, J.: Virtual grid-based clustering of uncertain data on
vulnerability database. J. Convergence Inf. Technol. 7(20), 429–438 (2012)
An Airport Noise Prediction Model Based
on Selective Ensemble of LOF-FSVR
1 Introduction
With the rapid development of China’s civil aviation, the airport noise pollution
problem is becoming more and more serious and becomes one of the obstacles
affecting the sustainable and healthy development of the civil aviation indus-
try. Therefore, how to effectively predict airport noise is of vital importance to
the design of airports and flights. Currently, the airport noise is commonly pre-
dicted by some noise calculation models and noise prediction software [1], such
c Springer International Publishing Switzerland 2015
Z. Huang et al. (Eds.): ICCCS 2015, LNCS 9483, pp. 536–549, 2015.
DOI: 10.1007/978-3-319-27051-7 46
An Airport Noise Prediction Model Based on Selective Ensemble 537
as INM [2], Noisemap [3] and Sound Plan [4]. However, each of these methods
has some drawbacks. For example, Asensio [5] pointed out that the INM model
does not consider the aircraft taxiing factor, leading to its prediction deviation;
Yang [6] pointed out that the INM model sometimes can not provide accurate
noise prediction for realistic environment. Thus, it is urgent to take advantage
of advanced information technology to scientifically predict airport noise so as
to provide decision support for civil aviation sector.
In recent years, owing to the development of wireless sensing technic and the
Internet of things, some effective noise monitoring systems have been developed.
In these systems, wireless sensors are located in the influence scope of airport
noise to capture real-time noise data, on which we can make a comprehensive
analysis and further provide decision support for related departments to control
the airport noise pollution problem.
The large amount of monitored noise data provides a sufficient condition for
applying machine learning algorithms to predict airport noise. Berg’s study [7]
indicated that we can effectively improve the noise prediction accuracy by com-
bining the acoustic theory calculation model with the pattern learned from the
actual noise data. Makarewicz et al. [8] studied the impact pattern of vehicle
speed on noise prediction in a statistical way based on the historical data, and
presented a method to calculate the annual average sound level of the road traf-
fic noise. Yin [9] adopted a multi-layer neural network using L-M optimization
algorithm to predict the road traffic noise. Wen et al. [10] analyzed the char-
acteristic of the noise data and presented a time series prediction model based
on GM and LSSVR algorithm. Xu et al. [11] applied a fast hierarchical clus-
tering algorithm to predict the noise distribution of flight events. Chen et al.
[12] applied an integrated BP neural network to build an observational learning
model, which provides interactive predictions for the abnormal noise monitoring
nodes. The above researches have indicated that noise prediction models trained
on the monitored noise data have better prediction performance than general
empirical noise calculation models.
As known, support vector machine (SVM)[13] is an excellent machine learn-
ing algorithm and has been demonstrated to have better prediction capacity than
many existing algorithms. Thus, in this work, we adopt SVM as the base learning
algorithm to train airport noise prediction model. However, in some cases, the
monitored airport noise data contains many outliers, leading to the degraded
performance of the trained SVM model. To improve the outlier immunity of
SVM, we design a Local Outlier Factor based Fuzzy Support Vector Regression
algorithm (LOF-FSVR), in which we present to calculate the fuzzy member-
ship of each sample based on a local outlier factor (LOF) to better describe the
deviation degree of each sample from the regression interval. Conducted exper-
iments demonstrate the superiority of our LOF-FSVR to other closely-related
algorithms.
In the past decade, ensemble learning [14] has become a powerful paradigm to
improve the prediction performance of individual models, it first trains multiple
diverse models, then combines their outputs using an appropriate combination
method, e.g., the simple majority voting rule, to obtain the final prediction result.
538 H. Chen et al.
l
1
(ζi− + ζi+ )
2
min w + C
2 i=1
⎧
⎨ yi − (w · φ(xi )) − b ≤ ε + ζi−
s.t. (w · φ(xi )) + b − yi ≤ ε + ζi+ (1)
⎩ − +
ζi , ζi ≥ 0, i = 1, 2, · · · , l
In the above formula, ε is the insensitive loss function, ζi− and ζi+ are the
relaxation factors describing the loss of the sample’s deviation from the regression
interval, and the constant C is the penalty to each relaxation factor determining
the punishment to a sample when the SVR function estimation error is larger
than ε. We call the samples deviating far away from the regression interval
outliers. However, fixed penalty factor C will make the regression function very
sensitive to the outliers, leading to the over-fitting of SVR as shown in Fig. 1,
which demonstrate that the regression interval moves towards the outliers with
the adding of outliers, thus affecting the accuracy of SVR.
An Airport Noise Prediction Model Based on Selective Ensemble 539
where |Nk (p)| denotes the number of k-distance neighbors of sample p, thus,
sample p’s local reachable density is calculated as the number of it k-distance
neighbors divided by the sum of the reachable distances of p with respect to
each of its k-distance neighbors, its practical meaning is the inverse ratio of the
average reachable distance of p’s k-distance neighbors.
Definition 5 (local outlier factor LOF). LOF is used to measure the isolation
degree of a sample. The LOF of sample p is the ratio between the average LRD
of p’s k-distance neighbors and p’s LRD, which is represented in the following
formula:
1 lrdk (o)
lofk (p) = , (8)
|Nk (p)| lrdk (p)
o∈Nk (p)
If lofk (p) is near to 1, in this case, sample p has a good relation with its
k-distance neighbors: very similar distribution and easily merged into the same
class; If lofk (p) is greater than 1, it means that sample p has a higher degree
of isolation than its k-distance neighbors. The more isolated a sample is, the
more likely an outlier it is. The concept of membership degree can describe the
degree of an object’s belonging to a class. From the above analysis, it is easy
to conclude that a sample’s membership degree is in inverse proportion to its
degree of isolation.
Definition 6 (LOF based membership function). The membership function
based on LOF is defined as follows:
(1 − θ)m + σ, lof < lofk (p) ≤ lofmax
μi = (9)
1 − θ, lofmin ≤ lofk (p) ≤ lof
where θ = (lofk (p) − lofmin )/(lofmax − lofmin ), lofk (p) is the LOF of sample p;
lofmin , lofmax and lof are the minimum, maximum and average values among
542 H. Chen et al.
LOFs of all the samples, respectively; σ < 1 is a very small positive real number
while m ≥ 2. The LOF based membership function is shown in Fig. 3, which
demonstrates that the larger LOF value a sample has, the more rapid decline
its membership gets. If lofk (p) is close to lofmax , sample p’ membership will be
close to a very small positive real number σ, indicating that the higher isolation
degree of a sample, the less impact it has on the regression model.
The distance, density and affinity based outlier factors, and the LOF of sam-
ples in the airport noise data are calculated, respectively. For each algorithm, the
samples whose values of outlier factors are among the top 100 are determined,
the number of real noise samples in these 100 samples is used to evaluate the
outlier identification ability of an algorithm. Table 1 shows the numbers of the
true outliers identified by these algorithms, it can be seen that the LOF algo-
rithm has better outlier identification ability than other three algorithms, this
is because LOF uses the concept of k-distance neighbors and can better identify
the outliers lying on the class boundary. Thus, we believe that, using the LOF
based fuzzy membership function, the FSVR algorithm will get better prediction
accuracy.
|x − xi |2
K(x, xi ) = exp{− }. (10)
σ2
For all the algorithms, 10-fold Cross Validation is adopted to search the best
values for parameters C and σ, the membership function parameter m is set
to 10. Figures 5 and 6 display the prediction results of ε-SV R and LOF-FSVR,
respectively, the Root Mean Squared Errors RMSEs (calculated according to
formula (11)) of all the algorithms are listed in Table 2.
1 N
RM SE =
(f (xi ) − yi )2 , (11)
N i=1
It can be seen from Figs. 5 and 6 that, with the adding of outliers, the regres-
sion interval of ε-SV R moves towards the outliers, which indicates that ε-SV R
is very sensitive to the outliers and its prediction ability is greatly reduced.
However, LOF-FSVR can reduce the impact of the outliers to some extent and
544 H. Chen et al.
Algorithm ε-SV R with- ε-SV R with density based affinity based LOF based
out outlier outlier F SV R F SV R F SV R
RMSE(dB) 8.0771 8.4151 7.8258 7.6790 7.4577
has better outlier immunity and more stable performance. From Table 2 we can
learn that, with the adding of noisy samples, the accuracy of all the FSVR
algorithms is significantly better than that of the ε-SV R algorithms, indicating
that the immunity of FSVR algorithms is improved due to the incorporation
of fuzzy membership. In addition, compared with the density based FSVR, the
proposed LOF-FSVR can better describe the degree of samples’ deviation from
the regression interval, therefore, it has better generalization performance.
A single prediction model may be unstable and not accurate enough in the pre-
diction phase, as ensemble learning [14,15] has been proved to be an effective way
of improving the performance of single models [16], thus, we employ ensemble
learning to enhance the prediction capacity of a single LOF-FSVR model.
In the combination phase of ensemble learning, the traditional method is
combining all the generated models, which results in a high time and space
complexity and slows down the prediction speed. In the past decade, selective
ensemble learning [28,29] has been proposed, which heuristically selects a subset
of generated models and discards the inaccurate and redundant ones. Theoretical
and empirical studies [29] have demonstrated that selective ensemble can reduce
An Airport Noise Prediction Model Based on Selective Ensemble 545
the ensemble size while maintaining or even improving the prediction perfor-
mance. Therefore, in this study, selective ensemble is utilized to further improve
the prediction performance. In our prediction model, we first adopt AdaBoost
[18] to generate many diverse LOF-FSVR models, then adopt the simple and
quick Orientation Ordering (OO)[29] selective approach to prune the generated
models, finally we get an ensemble prediction model. The concrete steps of our
prediction model are described in Algorithm 1 and the general process is dis-
played in Fig. 7.
• Input
training set S = {(x1 , y1 ), · · · , (xN , yN )}, the maximum iteration times T and
the initial weights of training samples: wi1 = 1/N, (i = 1, 2, · · · , N );
• Ensemble generation phase
Step 1: resample the training set S using AdaBoost according to its current
weight distribution {wit }N i=1 and obtain the sampled training set St , where t(1 ≤
t ≤ T ) denotes the tth iteration;
Step 2: calculate the fuzzy membership μi of each sample (xi , yi )(i =
1, 2, · · · , N ), the proposed LOF-FSVR is applied on the training set St to train a
model ht , then ht is used to predict the original training set S and the prediction
results are obtained: ft (xi ) (i = 1, 2, · · · , N );
Step 3: calculate the loss li = 1 − exp{−|ft (xi ) − yi |/D} on each sample,
where D = maxi=1,··· ,N (|ft (xi ) − yi |), and the average loss of the model: Lt =
N t
i=1 li wi . If Lt ≥ 0.5, go to Step 2;
Step 4: set a threshold λ for the loss value li , assume that
1, li ≤ λ
ct (i) = ,
−1, li > λ
• P rediction phase
Step 10: the selected S models are combined as the final ensemble prediction
model ES = {h1 , · · · , hS }. ES predicts a new sample x in this manner: rank
these S models in decreasing order according to their prediction results: Y =
{g1 (x), g2 (x), · · · , gS (x)}, the corresponding weights of the reordered models are
{wh1 , wh2 , · · · , whS }, the final output of ES is:
1
G(x) = inf {y ∈ Y, wht ≥ wht }. (12)
2 t
t:gt (x)≤y
From Fig. 8 and Table 3 we can learn that, the proposed ensemble prediction
model improves the prediction performance of single LOF-FSVR to a large extent
and the prediction error of each monitoring node is controlled at 6–7dB. In
the model combination phase, since we adopt the quick Orientation Ordering
method, the base models contained in the final ensemble account for 30 % to
50 % of the originally generated models, indicating that the proposed prediction
model greatly improves the prediction efficiency and performance.
References
1. Lei, B., Yang, X., Yang, J.: Noise prediction and control of Pudong international
airport expansion project. Environ. Monit. Assess. 151(1–4), 1–8 (2009)
2. David, W.F., John, G, Dipardo, J.: Review of ensemble noise model (INM) equa-
tions and processes. Technical report, Washington University Forum (2003)
3. Wasmer, C.: The Data Entry Component for the Noisemap Suite of Aircraft Noise
Models. http://wasmerconsulting.com/baseops.htm
4. GmdbH, B.: User’s Manual SoundPLAN. Shelton, USA (2003)
5. Asensio, C., Ruiz, M.: Estimation of directivity and sound power levels emitted
by aircrafts during taxiing for outdoor noise prediction purpose. Appl. Acoust.
68(10), 1263–1279 (2007)
6. Yang, Y., Hinde C., Gillingwater, D.: Airport noise simulation using neural net-
works. In: IEEE International Joint Conference on Neural Networks, pp. 1917–
1923. IEEE Press, Piscataway (2008)
7. Van Den Berg, F.: Criteria for wind farm noise: Lmax and Lden. J. Acoust. Soc.
Am. 123(5), 4043–4048 (2008)
8. Makarewicz, R., Besnardb, F., Doisyc, S.: Road traffic noise prediction based on
speed-flow diagram. Appl. Acoust. 72(4), 190–195 (2011)
9. Yin, Z.Y.: Study on traffic noise prediction based on L-M neural network. Environ.
Monit. China 25(4), 84–187 (2009)
10. Wen, D.Q., Wang, J.D., Zhang, X.: Prediction model of airport-noise time series
based on GM-LSSVR (in Chinese). Comput. Sci. 40(9), 198–220 (2013)
11. Xu, T., Xie, J.W., Yang, G.Q.: Airport noise data mining method based on hier-
archical clustering. J. Nanjing Univ. Astronaut. Aeronaut. 45(5), 715–721 (2013)
12. Chen, H.Y., Sun, B., Wang, J.D.: An interaction prediction model of monitoring
node based on observational learning. Int. J. Inf. Electron. Eng. 5(4), 259–264
(2014)
13. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, New York (2000)
14. Zhou, Z.H.: Ensemble Methods: Foundations and Algorithms. CRC Press, Florida
(2012)
15. Sun, B., Wang, J.D., Chen, H.Y., Wang, Y.T.: Diversity measures in ensemble
learning (in Chinese). Control and Decis. 29(3), 385–395 (2014)
16. Kuncheva, L.I., Whitaker, C.J.: Measures of diversity in classifier ensembles and
their relationship with the ensemble accuracy. Mach. Learn. 51(2), 181–207 (2003)
17. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
18. Freund, Y., Schapire, R.E., Tang, W.: A decision-theoretic generalization of online
learning and an application to boosting. J. Comput. Syst. Sci. 55(1), 119–139
(1997)
19. Melville, P., Mooney, R.J.: Creating diversity in ensembles using artificial data.
Inf. Fusion 6(1), 99–111 (2005)
20. Sun, B., Chen, H.Y., Wang, J.D.: An empirical margin explanation for the effec-
tiveness of DECORATE ensemble learning algorithm. Knowl. Based Syst. 78(1),
1–12 (2015)
An Airport Noise Prediction Model Based on Selective Ensemble 549
21. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
22. Lin, C.F., Wang, S.D.: Fuzzy support vector machines. IEEE Trans. Neural Netw.
13(2), 464–471 (2002)
23. Smola, A.J., Scholkopf, B.: A tutorial on support vector regression. Stat. Comput.
14(3), 199–222 (2004)
24. Gu, B., Wang, J.D.: Class of methods for calculating the threshold of local outlier
factor. J. Chinese Comput. Syst. 29(12), 2254–2257 (2008)
25. Breunig, M., Kriegel, H.P., Ng, R.T.: LOF: identifying density-based local outliers.
In: Proceedings of the ACM SIGMOD International Conference on Management
of Data, pp. 93–104. Assoc Computing Machinery, New York (2000)
26. An, J.L., Wang, Z.O., Ma, Z.P.: Fuzzy support vector machine based on density.
J. Tianjin Univ. 37(6), 544–548 (2004)
27. Zhang, X., Xiao, X.L., Xu, G.Y.: Fuzzy support vector machine based on affinity
among samples. J. Softw. 17(5), 951–958 (2006)
28. Guo, L., Boukir, S.: Margin-based ordered aggregation for ensemble pruning. Pat-
tern Recogn. Lett. 34(6), 603–609 (2013)
29. Martinez-Munoz, G., Suarez, A.: Pruning in ordered bagging ensembles. In: Pro-
ceedings of the 23rd International Conference on Machine Learning, pp. 609–616.
ACM Press, New York (2006)
30. UCI Machine Learning Repository. http://www.ics.uci.edu/∼mlearn/mlrepository.
html
31. Libsvm: A Library for Support Vector Machines. http://www.csie.ntu.edu.tw/
∼cjlin/libsvmtools/datasets/
Application of Semantic-Based Laplacian
Eigenmaps Method in 3D Model Classification
and Retrieval
1 Introduction
With the development of computer network and the universal application of 3D soft-
ware, a lot of 3D models have been produced and widely disseminated [1, 2], so that
the 3D model retrieval technology is becoming a hot topic in the field of multimedia
information retrieval [3–5]. Content-based retrieval methods (CBIR) is the main con-
tents of the 3D model retrieval technology, this method usually requires extracting
shape features of 3D model, and these features are usually a high-dimensional vector. If
they represent a point in the scale space, 3D model retrieval can be converted to the
problem of finding the nearest point of a fixed point in the high dimensional scale
space. However, for large-scale 3D model database, the distance calculation and
storage in a high-dimensional vector space require very high computational complexity
and space complexity. For this reason, people generally improve the retrieval efficiency
by indexing technology or dimension reduction techniques. Dimension reduction
technique maps set of points from high-dimensional scale space to low-dimensional
scale space. This mapping generally requires maintaining their original structure
© Springer International Publishing Switzerland 2015
Z. Huang et al. (Eds.): ICCCS 2015, LNCS 9483, pp. 550–559, 2015.
DOI: 10.1007/978-3-319-27051-7_47
Application of Semantic-Based Laplacian Eigenmaps Method 551
unchanged and removing redundant information, thus reduces the computation com-
plexity and storage complexity and improve recognition speed of 3D model.
The traditional CBIR method uses physical information of 3D models, and does not
consider the semantic information of 3D model. Because of the semantic gap [6], the
retrieval results of traditional methods are generally similar in shape but the semantic
expression are necessarily not identical. In this paper, a method of Semantic-based
Laplacian Eigenmap (SBLE) is proposed. We use the semantically nearest neighbor in
heterogeneous semantic network instead of the distantly nearest neighbor in 3D model
feature space, and embed the semantic relations in semantic space into the low
dimensional feature space by feature mapping. The method retained massive semantic
information of 3D models during the course of dimension reduction. Experiment
results show the effectiveness of the proposed method for 3D model classification and
retrieval on Princeton Shape Benchmark [7].
2 Laplacian Eigenmap
Laplacian Eigenmap [8] is an effective manifold learning algorithm. The basic idea of
Laplacian Eigenmap is that some points of near distance in the high dimensional space,
which is projected into a low dimensional space, they should also be very close.
Suppose W is the similarity matrix. Eigenvalues decomposition of the Laplacian matrix
L = D−W:
Ly ¼ kDy ð1Þ
where Dii = ∑jWji is a diagonal weight matrix. The objective function of Laplacian
Eigenmap is
X
Yopt ¼ arg min Yi Yj 2 wij ¼ trðY T LYÞ ð2Þ
Y
i;j
Y T DY ¼ I ð3Þ
Where y1,…,yr are the first r smallest eigenvectors of Eq. (1). The i -th row of
Y = [y1,…,yr] is the new coordinate of point i.
Laplacian Eigenmap method can effectively keep distance relationship of data in the
original space, which is often used in a variety of multimedia information retrieval with
a high-dimensional data, but the method is also subject to restrictions on the “semantic
gap”, data can not contain certain semantic information in dimensionality reduction
process. A semantic-based Laplacian Eigenmap method is proposed, the method
552 X. Wang et al.
Human
classification
Relevance
Feedback
had Same as
include
include
The figure includes part of the semantic annotation Information of 3D models (text
description); the relevance feedback information; artificial classified information and
semantic ontology reasoning results (red line), etc.
Here we discuss several ways of semantic expression.
In the heterogeneous semantic network, for some 3D models with same annotation
can be seen as the same category; for other 3D models with ontology relationship
annotation, we only consider “same as” attribute of the model, namely “similar”
relationship. Therefore, the definition of 3D models classification, annotation and
ontology semantic information is as follows:
Definition 1. 3D model database M = {m0, m1, ……,mn}. If mi and mj belong to the
same category, mi is directly semantic correlated with mj, and the correlation times
between mi and mj is termed as SemI(mi, mj).
Application of Semantic-Based Laplacian Eigenmaps Method 553
We can use the different semantic information to construct the corresponding semantic
correlation matrix. The Semantic Correlation Matrix contains many properties of
Unified Relationship Matrix [10].
According to the matrix Dsemantic, for any model mi, (Di1, Di2, …, DiN) depicts the
semantic correlation times between mi and other models, which can be used as the
semantic properties of the model mi. The definition of semantic distance in Dsemantic is
as follows:
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u N
uX
Dissem ðmk ; mp Þ ¼ t ðDki Dpi Þ2 ð4Þ
i¼1
XN X
1 2
yi yj Wij ¼ trðY T LYÞ
i¼1 j6¼i
2
Using the SBLE method for 3D model retrieval, we need the content-based 3D model
retrieval system to help the new system. The new system also needs the original 3D
model database and the new database after dimension reduction with SBLE.
Retrieval process is as follows:
(1) Users need to provide a target model to be retrieved, the system first determines
whether the 3D model in the original database or not, if it exists, the system will
use the low dimensional feature vectors of the model as a new retrieval object, and
retrieve the target models in the database after dimension reduction with SBLE
directly.
(2) If the target 3D model is not in the database, the system needs to extract the
features of 3D model and uses the traditional content-based method to retrieve the
models from the database, and returns the retrieval results to the user for relevance
feedback, and then using the method of SBLE to reduce the dimension of the
models in the database.
(3) Finally, we use the low-dimensional feature vectors as the search target to retrieve
in the database after dimension reduction with SBLE.
We choose the top 907 3D models in PSB Library as our experimental data set. There
are 65 semantic classifications, feature extraction methods using depth-buffer method
[11]. The dimensions of the 3D model features are 438. The classification standard
adopts the basic manual classification of PSB.
In the aspect of semantic expression, we use relevance feedback and artificial
semantic annotation (less than 5 %) and ontology reasoning method [9] to obtain the
new semantic annotation information, then construct a 3D model semantic relationship
matrix. In the relevance feedback process, in order to get the semantic relations of the
3D models, we let five people to participate in the test, and the goal is to retrieve the top
907 models in PSB library randomly, which is based on the content retrieval method,
and feedback semantically related model.
Figure 2 shows the visual results of the final semantic relationship matrix.
X
k
ni 1 X nij
q
nj
Entropy ¼ ð log i Þ ð5Þ
i¼1
N log q j¼1 ni ni
Application of Semantic-Based Laplacian Eigenmaps Method 555
100 4.5
200 4
3.5
300
3
400
2.5
500
2
600
1.5
700
1
800 0.5
900 0
200 400 600 800
X
k
1
Purity ¼ maxðnij Þ ð6Þ
i¼1
N j
Where q and k are the number of artificial classification and clusters respectively, nji
is the number of the data that belongs to the j-th class in original data set but appears in
the i-th cluster. The smaller the Entropy value is and the bigger the Purity value is, the
better the clustering effect performance.
X-means [13] clustering algorithm is used to cluster the original shape features in
experiment, and set the number of clusters in the range of 60–70. Table 1 shows the
comparison of clustering evaluation results between original features by X-means
algorithm and the SBLE algorithm.
From Table 1 we can see that the results of proposed method were significantly
better than the results of original features clustering.
original 3D model features after Euclidean distance measure; LE20N12 and LE20N7
represent the retrieval efficiency after using classic LE method and the new features is
20 dimensions and the nearest neighbors (NN) is 12 and 7 respectively. SBLE20N12
and SBLE20N7 represent retrieval efficiency after using the proposed SBLE and the
new features is 20 dimensions, the nearest semantic neighbors (NN) is 12 and 7
respectively.
The Fig. 3 shows that the retrieval efficiency of the proposed method was better
than the other two methods.
1.0
LE20N12
SBLE20N12
0.8 LE20N7
SBLE20N7
O907
0.6
Precision
0.4
0.2
0.0
1.1
LE20N12
1.0 SBLE20N12
O907
0.9
0.8
0.7
Precision
0.6
0.5
0.4
0.3
0.2
0.2 0.4 0.6 0.8 1.0
Recall
(1).ant
1.0 LE20N12
SBLE20N12
O907
0.8
0.6
Precision
0.4
0.2
0.0
(2).bird
LE20N12
1.0
SBLE20N12
O907
0.8
0.6
Precision
0.4
0.2
0.0
1.0 LE20N12
SBLE20N12
0.9 O907
0.8
Precision
0.7
0.6
0.5
0.4
0.3
(4).watch
6 Conclusion
References
1. Xu, B., Chang, W., Sheffer, A., et al.: True2Form: 3D curve networks from 2D sketches via
selective regularization. ACM Trans. Graph. (Proc. SIGGRAPH 2014) 33(4), 1–13 (2014)
2. Wang, Y., Feng, J., Wu, Z., Wang, J., Chang, S.-F.: From low-cost depth sensors to CAD:
cross-domain 3D shape retrieval via regression tree fields. In: Fleet, D., Pajdla, T., Schiele,
B., Tuytelaars, T. (eds.) ECCV 2014, Part I. LNCS, vol. 8689, pp. 489–504. Springer,
Heidelberg (2014)
3. Thomas, F., Patrick, M., Misha, K., et al.: A search engine for 3D models. ACM Trans.
Graph. J. 22(1), 85–105 (2003)
4. Luo, S., Li, L., Wang, J.: Research on key technologies of 3D model retrieval based on
semantic. J. Phy. Procedia 25, 1445–1450 (2012)
5. Mangai, M.A., Gounden, N.A.: Subspace-based clustering and retrieval of 3-D objects.
J. Comput. Electr. Eng. 39(3), 809–817 (2013)
6. Zhao, R., Grosky, W.I.: Negotiating the semantic gap: from feature maps to semantic
landscapes. J. Pattern Recogn. 35(3), 51–58 (2002)
7. Shilane, P., et al. The princeton shape benchmark. In: The Shape Modeling International,
pp. 388–399 (2004)
8. Belkin, M., Niyogi, P.: Laplacian eigenmaps for dimensionality reduction and data
representation. Neural Comput. 15(6), 1373–1396 (2003)
9. Wang, X.-y., Lv, T.-y., Wang, S.-s., Wang, Z.-x.: An ontology and SWRL based 3D model
retrieval system. In: Li, H., Liu, T., Ma, W.-Y., Sakai, T., Wong, K.-F., Zhou, G. (eds.)
AIRS 2008. LNCS, vol. 4993, pp. 335–344. Springer, Heidelberg (2008)
10. Xi, W., Fox, E.A., Zhang, B., et al.: SimFusion: measuring similarity using unified
relationship matrix. In Proceedings of the 28th Annual International ACM SIGIR
Conference on Research and Development in Information Retrieval (SIGIR 2005),
Salvador, Brazil, pp. 130–137 (2005)
11. Heczko, M., Keim, D., Saupe, D., Vranic, D.V.: Methods for similarity search on 3D
databases. J. German Datenbank-Spektrum 2(2), 54–63 (2002)
12. Zhao, Y., Karypis, G.: Criterion functions for document clustering: experiment and analysis.
Technical report, University of Minnesota, pp. 1–40 (2001)
13. Pelleg, D., Moore, A.: X-means: Extending K-means with efficient estimation of the number
of clusters. In Proceedings of the 17th International Conference on Machine Learning
(ICML 2000), Stanford University, pp. 89– 97 (2000)
Author Index