0% found this document useful (0 votes)
51 views8 pages

Cybersecurity in The Era of Data Science Examining New Adversarial Models

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views8 pages

Cybersecurity in The Era of Data Science Examining New Adversarial Models

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

BIG DATA PROBLEMS IN CYBERSECURITY

Cybersecurity in the Era of Data Science:


Examining New Adversarial Models
Bülent Yener | Rensselaer Polytechnic Institute
Tsvi Gal | Morgan Stanley

The ever-increasing volume, variety, and velocity of threats dictates a big data problem in cybersecurity
and necessitates deployment of AI and machine-learning (ML) algorithms. The limitations and
vulnerabilities of AI/ML systems, combined with complexity of data, introduce a new adversarial model,
which is defined and discussed in this article.

D ata science provides tools for synthesizing large


amounts of data very quickly, improving running
hypotheses (in supervised and reinforced methods) as well
massive (unstructured) data models, new techniques,
such as generative adversarial networks (GANs), are
moving into the mainstream of tools being used. The
as identifying unforeseen patterns (so-called unknown term AI refers to the discipline that includes knowledge
unknowns in the unsupervised method). It enables sci- representation, natural language processing (NLP),
entists to extract knowledge, identify patterns, and build planning, ontologies, and ML. Thus, although ML is
predictive models from complex data characterized by 1) the focus of this article, it is a subarea in AI.
very large volumes (terabytes to petabytes), 2) fast arrival/ As we discuss in the “Trustworthiness of Data” sec-
generation rate (velocity) by multiple heterogeneous and tion, solutions must borrow from other areas in AI, such
noisy (e.g., dirty, not useful, or misleading) data sources, as ontologies and semantics. AI/ML algorithms can be
and 3) different modalities (variety). Recently, new threats, used for clustering, classifying, or predicting events of
such as ransomware, introduced another feature for com- interest. For example, various classifiers, including deep
plex data: their ever-increasing value. neural networks (DNNs), are widely used in 1) distin-
Various AI/ML techniques have been successfully guishing malware from benign programs and 2) classi-
deployed in cyberdefense for access control, traffic fying malware to one of the known families.1 However,
analysis, spam detection, anomaly detection, and intru- these sophisticated algorithms have fundamental limi-
sion detection. User and entity behavioral analytic tech- tations, which can be intentionally abused to misclas-
niques analyze behavior logs and network traffic in real sify a malware, destroy the reputation of a company,
time to respond to an attack. evade an anomaly detection system, and even attack
All of these systems are based on either training ML the underlying cryptographic systems. Thus, in this era
algorithms on labeled data for supervised learning or of “big data” when we rely more and more on AI/ML
deploying unsupervised techniques to cluster unlabeled techniques, a new adversarial model emerges with new
data, or some combination of labeled and unlabeled threat models that must be understood, along with a
data for semisupervised learning tasks. As modern AI discussion about how to encounter it.
and ML move from rule-based limited data systems to
Challenges Resulting From Big Data
Digital Object Identifier 10.1109/MSEC.2019.2907097
Training AI/ML algorithms and building models
Date of current version: 3 May 2019 for complex data are a challenge for several reasons,

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
46 November/December 2019 Copublished by the IEEE Computer and Reliability Societies 1540-7993/19©2019IEEE
including 1) noisy (from filtering) data streams, 2) data correlation of traffic flows as well as build hypotheses for
coming from multiple sensors (indicating a data fusion predicting threats (potentially against an adversary that
problem), and 3) data having different modalities (e.g., may also use AI/ML techniques to evade our systems).
multiway models). Learning from multiple heteroge-
neous data streams requires multivariate models and High-Velocity Data Analysis Is Complicated
computing complex distributions (e.g., prior probability In the analysis of big data, velocity may enforce a limi-
distributions or density functions). In many cases, the tation on how many data can be processed in real time
exact computation of these distributions is not feasible (or near real time) and may require working with
(due to unbounded partial information
sums, infinite inte- to compute a function
grals, and so on); on the data. The need
thus, AI/ML can pro- for a timely response
vide only approximate (interactive applica-
solutions, which are One approach used in data science is called tions) and the limi-
found in some local sketching, which produces approximate results tations of compute
optimum. The impli- that are orders of magnitude faster and have resources (scalability)
cations of these limi- provable error bounds. prohibit generating
tations are not well exact results. One
understood for the approach used in data
cyberdefense systems science is called sketch-
that rely on automa- ing, which produces
tion and AI/ML tech- approximate results
niques. Some specific challenges due to the nature of big that are orders of magnitude faster and have provable
data are mentioned in the following sections. error bounds. However, sketches may capture only lim-
ited temporal states (sliding window model), create
High-Volume Threats Without Robust artifacts (graph sparsification), and may be limited
Attribution (e.g., computing entropy for detecting outliers). The
The pervasive and ubiquitous nature of the Internet smart adversary model can exploit these limitations and
makes distributed and large-scale attacks [e.g., distributed evade defense systems, which rely on real-time pro-
denial-of-service (DDoS) attacks] very effective tools for cessing of data.
an adversary.3 Although these threats are known to the The essence of the problem is the difficulty of learn-
cyberdefense community, there are two new factors: ing patterns and making predictions from fast-moving
1) know-how is not limited, and sophisticated attackers time-series data. For certain applications, simple sta-
are easily found in the cyberworld [e.g., Internet Relay tistical techniques such as a moving average computed
Chats (IRCs)] and 2) underground economies in the in a small window size (e.g., in milliseconds) can be
darknet provide overwhelming computing power. a useful indicator, but in cybersecurity applications,
Because it is possible to buy malicious software, ranging such averaging techniques may not be as useful. The
from packers to zero-day malware, and rent botnets for patterns of interest can be detected by a combina-
less than US$1 per machine in underground markets, tion of multiple features in a high-dimensional space.
deciding the attribution for an attack is very difficult. For For example, (near) real-time anomaly detection
example, a 16-year-old who learned how to implement and outlier identification problems require extract-
DDoS attacks from experts used available code from an ing a high-dimensional feature vector from the data
IRC and, wielding a botnet, launched a DDoS attack streams, which can be used to identify patterns by
against Yahoo, eBay, and Amazon, causing an estimated learning algorithms. Similarly, analyzing incoming
US$1.2 billion in damages (see https://en.wikipedia http traffic streams for detecting phishing attacks may
.org/wiki/MafiaBoy). require examining the entire page and maintaining a
Technologies such as botnets, digital cash, and anon- much larger window of operation.
ymous communications (e.g., Tor) can be abused by vir- A particularly challenging problem in high-velocity
tually any entity (person, organization, or state) that can data streams is detecting phase transitions. This is also
buy malware, rent botnets, and embark on a cyberattack. known as the change point detection (CPD) problem and
For example, in Figure 1 we show how readily available concerns estimating probability distributions over a
it is to rent or buy tools for a cyberattack. Thus, we must high-dimensional space created by latent features that
consider sophisticated defense systems that use AI/ control the observed data. A smart adversary equipped
ML techniques to enable the automatic detection and with adversarial input can modify the shapes of the

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
www.computer.org/security 47
BIG DATA PROBLEMS IN CYBERSECURITY

Figure 1. A screenshot of available botnets in an underground market where digital cash could be spent.

learned distributions (e.g., flatter and overlapping dis- A New Threat Model: Smart Adversary
tributions) to evade CPD systems. Implicit in these We define a new adversarial model: the smart adversary,
examples is the multiplicity of data with respect to dif- in which an adversary uses sophisticated techniques
ferent sources and modalities, which require multivari- for defensive purposes against its targets. For exam-
ate treatment of data. There are well-known scalability ple, ransomware attacks demonstrate how a sophisti-
and feasibility challenges that extend univariate learn- cated adversary can use cryptography for protection
ing models to multivariate ones. Addressing the CPD and as a weapon against its victims. Similarly, a smart
problem in high-velocity, multivariate data streams is a adversary can abuse AI/ML techniques to beat intel-
particularly difficult one. ligent cyberdefense systems that rely on AI/ML. Such
Because modern attacks take a long-term view of com- attacks can target the training data, test data, and model
promising their target, they are no longer single attacks parameters used by AI/ML algorithms depending on
or even a concentrated barrage attack; instead, they are the underlying white-box (open) or black-box (closed)
continuous, sophisticated, and mostly undetected. They architecture.
often aim to infiltrate an entire network, with the goal of The goals of a smart adversary may include eroding
remaining sustainable once they succeed. Advanced per- confidence and trust in the underlying AI/ML system
sistent threats (APTs) move from infiltration to expan- (by way of introducing false negatives and false posi-
sion and extraction and are able to cover their tracks. APT tives), targeting a particular class for misclassifications,
attacks differ from traditional web application threats in and evading automated detection by obfuscation.
that they are more complex, target an entire network, The techniques a smart adversary can deploy
and, most importantly, are persistent. We will revisit this depend on the target system. Open-box systems are
issue in the “A New Threat Model: Smart Adversary” sec- transparent; the internal states, data, and gradient cal-
tion, where we discuss insider threats. culation are openly available, while in black-box sys-
The continuous nature mentioned previously means tems, only the input–output relationship is observed.
that once an environment is compromised, the adver- For example, in black-box systems, an adversary may
sary stays put to acquire as much information as simulate the behavior of ML algorithms on surrogate
possible. These attacks can take a variety of forms, training data sets and then perform a sensitivity analysis
such as an SQL injection, remote file inclusion, and to understand the relationship between data instances
cross-site scripting. Their persistent nature makes and accuracy of the algorithms. This understanding
these attacks especially difficult to detect and fight with can be used to design causative attacks on learning
tools such as defense in depth, firewalls, and antivi- algorithms by poisoning the training data (e.g., poi-
rus (AV) programs. soning attacks on support vector machines).2 On the

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
48 IEEE Security & Privacy November/December 2019
other hand, if the internal states of the ML algorithm by a decision.” This requirement asserts that the under-
targeted by an adversary are known (i.e., open-box sys- lying AI/ML model explains what the parameters of the
tems), then exploratory attacks that exploit the algo- model are and how they operate to reach their decision.
rithm become possible. There are several ways AI can be explained when it is
In both classes of attacks, the adversary carefully used in a risk or regulatory context. Scoring algorithms,
chooses specific data points to perturb depending e.g., inject noise and artificial data points around a real
on the algorithm used to create an adversarial input. point and observe what features impact the score (such as
Deploying an adversarial input against a cybersecurity local, interpretable model-agnostic explanations). Others
system that uses AI/ML is similar to zero-day exploits include sensitivity analysis,7 influence curves,8 relevance
and can have serious consequences. propagation, and an information bottleneck that estab-
lishes a tradeoff between compression and prediction.9
Adversarial Input Exploits Instability There is an irony in the explainability of ML algo-
in DNN Architectures rithms and adversarial learning: although understanding
All ML algorithms are vulnerable to adversarial input the internal operations of black-box learning algorithms
(even the linear models), but DNNs are particularly (e.g., gradient based) is very desirable, the techniques
interesting because of the nonconvexity and nonlin- used to decipher them can also be used to choose the
earity of the learned models.3,4 We can model a DNN best data points to abuse them. For example, sensitivity
by a multidimensional function F: X " Y, where X is analysis of a DNN would show us how changes in input
d-dimensional feature (input) vector and Y is an output will impact the output (i.e., it quantifies the importance
vector (e.g., probability assignments to each “class” for of each input variable i on the observed change of func-
input X). If a DNN has n layers, then we can ­decompose tion (x)) as R i = ; (d/dx i) f (x) ;.
F as F (X) = (Fn(Fn - 1(f F1(X, i 1), f i n - 1), i n), where By computing the forward or backward derivatives
ii is the trained parameters at layer i. (depending on the type of the DNN architecture), one
An adversarial data point X can be constructed by may gain some insight into how the algorithms works.
adding a minimum perturbation dx so that X l = X + d X For instance, if the algorithm is open box, a smart adver-
and mind
< d X < such that F (X l ) has a label Y l that is differ- sary can deploy the same techniques to choose input
X
ent from Y, thus causing a misclassification. The inher- data that will maximize the misclassification or discrim-
ent instability of a DNN with n layers under adversarial ination against a subpopulation by positioning the train-
input is attributed to how perturbation to input X prop- ing set or eroding trust. If the algorithm is black box, a
agates over n layers. The impact of perturbation can be smart adversary can deploy transfer-learning techniques
bounded by the Lipschitz constant at each layer i, and, by developing a surrogate model that emulates the tar-
overall, the perturbation has an avalanche effect because get algorithm. Thus, our efforts to understand ML bet-
it grows as5 ter also create a more sophisticated adversary.
< Fi(X) - Fi(X + d X) < # L i < d X <
n
Decoys and Obfuscation as
" < F (X) - F (X + d X) < # < d X < % L i . Adversarial Input
i=1 One application of adversarial data generation in the
There is a tradeoff between adding more layers to a cybersecurity domain is malware obfuscation, i.e.,
DNN architecture for reduced training time and its how to modify the code to evade static malware detec-
increased instability. Current architectural consider- tion algorithms while ensuring that malicious behavior
ations for DNNs are focused on accuracy-performance remains unchanged. Obfuscation successfully defeats
tradeoffs and do not address the impact of adversarial signature-based AV systems.10
input on the architecture choices, although such consid- In dynamic analysis environments, malware is exe-
erations may provide some level of defense.6 cuted in a sandbox and its behavior is examined. How-
ever, new-generation malware is smart enough to detect
Explainable AI/ML Versus if it is being instrumented and avoids dropping its pay-
Adversarial Input load. It is particularly important for zero-day attacks
There is increased demand for ML algorithms to explain to drop the payload when there is no instrumentation
and justify how they arrive at a particular result. For exam- (e.g., no sandboxing). As a result, defensive technolo-
ple, the General Data Protection Regulation (GDPR) gies (AV systems) try to obfuscate their existence and
defines the law on data protection and privacy for all avoid detection so as to “fool” the malware to exhibit its
individuals who live in the European Union. Accord- malicious behavior in a sandbox.10
ing to Article 22 of the GDPR, customers must have It is not farfetched to imagine smart malware that
“clear-cut reasons for how they were adversely impacted can access a website and utilize an AI/ML system to

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
www.computer.org/security 49
BIG DATA PROBLEMS IN CYBERSECURITY

decide whether it is interacting with a native or instru- adversarial behavior is spread over time and composed
mented environment. In this scenario, an adversary of low-intensity steps to stay under the radar. ML algo-
(malware) can train a binary classifier to distinguish rithms that deploy past information, such as hidden
“fake” environments from “native” ones. To counter Markov models and long short-term memory architec-
such an adversary, defensive technologies must gener- tures, can be used to capture the behaviors of each user,
ate sandboxes that can fool the distinguisher (Dis). This and anomaly detection mechanisms can be fine-tuned
scenario resembles a typical GAN11 model where the at the expense of scalability.
adversary is the Dis and defensive technology is the
generator (Gen) component. There is a zero-sum game Trustworthiness of Data
played between the Dis and Gen, which establishes the As cyberthreats become increasingly pervasive and
limits of best-possible obfuscation. global, nation states, law enforcement agencies, and pri-
Decoy attacks are common tools for obfuscation. vate organizations share potential v­ ulnerabilities,
For example, a zero-day attack can be wrapped around e.g., zero-day exploits, new patterns of attacks, and
by an easier-to-catch malware as a decoy. If an AV system the intelligence information to defend. However, such
takes the bait and catches the decoy malware, the actual information sharing and collaboration are vulnerable to
zero-day exploit will not be dropped or unpacked. information warfare in which an adversary may inject
As mentioned previously, we saw the first case of malicious or deceptive information to poison the col-
ransomware being used as a decoy in late 2015. How- laborative defense systems.
ever, by the end of 2016, more targeted attack groups Manufacturing and disseminating malicious infor-
had adopted the tactic. One of the most high-profile mation (e.g., news, rumors, opinions, and malicious
examples was the Sandworm cyberespionage group, sensor data) have always been important tools in infor-
which created a new version of its destructive Disakil mation warfare; however, they have recently become a
Trojan (Trojan.Disakil), disguised as ransomware. As serious threat to national security because many nations
with DDoS attacks, using ransomware as a decoy had rely more and more on news streams from the Internet
a similar effect, sowing confusion among the victims in the form of social networks, blogs, instant messages,
and delaying an effective response. A recent example emails, and so on.
was used when a politically themed document was Recent advances in AI, including ML and NLP, cre-
circling in the Philippines (see https://www.fortinet ated interest in how these systems can be deployed to
.com/blog/threat-research/hussarini—targeted-cyber combat the adversarial and malicious data problem.
-attack-in-the-philippines.html).16 Unfortunately, current state-of-the-art methods in these
fields demonstrate a very modest level of success, which
Eroding Trust indicates that more research must be done in these areas.
A smart adversary can manipulate AI/ML algorithms Furthermore, deep ML algorithms are themselves subject
to misclassify input data to a target class. An adversary to adversarial data and can be manipulated to malfunction
may, e.g., manipulate the learning algorithm to maxi- with high confidence, as mentioned previously. An early
mize false-positive rates, which would disrupt normal poisoning of mission-critical shared data can be very dif-
operations and create a trust problem for the ML algo- ficult to detect and attribute. A promising approach that
rithms. The target class can be chosen to attack a par- creates and maintains distributed trust for shared data is
ticular employee, client, or subpopulation depending adapting blockchain technologies to address many (not
on the context. all) of the concerns about sharing mission-critical data. A
A specific power adversarial model that combats AI/ blockchain can be created when data are generated, and
ML systems is the insider threat model. Insider attackers all access to and modification of the data can be enforced
operate behind firewalls in semitrusted environments by cryptographic contracts, thus providing accountabil-
because they may be legitimate employees, contrac- ity and auditability for forensics purposes. This phe-
tors, or collaborators with access to how AI/ML algo- nomenon, termed universal error, suggests that an error
rithms work, what kind of rule base they implement, made (or maliciously caused) early enough in the chain
what red flags exist, and so on. As a result, they aim to would be accepted as a “golden source” (i.e., as ultimate
avoid raising a red flag by not violating rules. For exam- truth) and transferred as such to all downstream systems,
ple, knowing the operational details of how an anom- causing the entire ecosystem to accept this erroneous
aly detection system may work in his/her company, an data point as “fact.” Additionally, given system intercon-
insider attacker may divide the malicious behavior into nectivity and the self-correction modules within many
minuscule and inconspicuous steps to conceal them in institutions, a well-placed false fact can be duplicated and
a stream of legitimate behaviors of benign users. Insider accepted by many or all participants in the network as
attacks are examples of complex attack vectors in which truth, making it a universal error.

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
50 IEEE Security & Privacy November/December 2019
An important concept in cybersecurity is the attribu- However, smart data are aware of what algorithms,
tion of malicious acts, ranging from DDoS attacks to a statistical tests, and abstractions can be executed while
collection of malicious users’ accounts in social media, ensuring integrity, access control, confidentiality, and
instigating an information warfare campaign. Further- auditability. For example, consider a protected health
more, malicious accounts can be used for cybersecurity information record (PHIR) that includes diagnosis,
attacks, including impersonation, hacking user accounts, medical test results, and demographic and insurance
and spreading malware. Attribution in information warfare information. A smart PHIR would allow certain fields
requires finding statistical patterns that indicate correlated to be accessed only by medical personnel, others only
and coordinated similar activities. AI/ML techniques that by insurance companies, and yet other fields only by
address attribution problems encounter similar problems data mining companies. Depending on the value of
faced when combating spam messages because of a lack of each field, a smart PHIR may deploy different fees and
semantic interpretation or some level of contextual under- charging mechanisms where the integrity of each data
standing. It is relatively easy to deceive an AI/ML system field is enforced by digital signatures and cryptographic
and cause both false positives and false negatives by intro- check sums. PHIRs can be generalized to other data-
ducing misinformation and noise; therefore, a smart adver- base applications (e.g., human resource, defense, or
sary currently has the upper hand in deception. advanced manufacturing) where different data fields
have different security and semantic properties. Smart
data may not trust the database server and require an

A s we move toward a data-centric world in which


data are increasingly treated like commodities, it
is no longer enough to think of data as spreadsheets or
encrypted database in which operations only on the
ciphertext are allowed. A database that keeps data unen-
crypted is vulnerable to multiple attacks.
tables because applications demand answers to ques- Many security breaches of government databases
tions such as what is the origin of these data; what have been reported in recent years. If a database is not
are the semantics and how do they change over time; encrypted, then data may be leaked (i.e., a breach of con-
how and by whom are data modified; and who owns fidentiality) or cross-correlated (i.e., a breach of privacy).
the data. These questions are related to “trustworthi- If the data are kept in the cloud, a smart adversary can
ness” (integrity, confidentiality, authenticity, and audit- potentially mishandle calculations, delete data, refuse to
ability) of data, and some of them can be answered by return results, collude with other parties, and so on. An
data provenance and lineage techniques. However, data adversary can ruin the reputation of a cloud business with
provenance provides a contextual and historical record such an attack as well. However, protecting confidential-
of passive data but cannot address more elaborate or ity, integrity, and privacy of data creates a tradeoff with
algorithmic questions such as how do we verify and what can be learned and mined from it.
resolve semantic inconsistencies; what are the descrip- A compromise would be to deploy capability-based
tive and predictive features of data; and what is the value access mechanisms (to specify which fields of a PHIR can
of data with respect to statistical and machine-learning be accessed by whom and what operations are allowed
tasks. These are particularly important questions for on these fields) along with more-advanced encryption
ML algorithms that extract the features directly from techniques that can operate on the ciphertext. These
data without human intervention or domain expertise. techniques include 1) homomorphic encryption (HE),
It has been suggested that DNNs do not learn the which allows some aggregate statistics to be computed
semantics and high-level abstractions in the data set on the encrypted data only (e.g., average blood pressure
but learn only simple surface statistical regularities: over the last five years); 2) functional encryption (FE),
thus, ML models generalize well without ever having to which controls the fields that can be accessed by autho-
explicitly learn abstract concepts12 (perhaps this is one rized parties [in FE, given a ciphertext C(m) for a mes-
of the reasons why CNNs are vulnerable to adversarial sage m and a secret key for a function f, a user can learn
examples). Finally, questions pertaining to the trust- only the value f (m)]; or 3) attribute-based encryption,
worthiness of data must be addressed. in which the secret key provides an access formula that
Perhaps what is needed is a new, algorithmic and operates over a set of n attributes that must evaluate to
value-based perspective for data-as-a-commodity that true for decryption to yield the message.
is designed to be active and smart while keeping its Although the cryptographic techniques mentioned
provenance and lineage intact. Smart data rely on 1) in this section are not widely in practice due to their high
data provenance and lineage information and 2) ontolo- complexity, their lightweight implementations are attrac-
gies that define their context and associated semantics. tive, even at the expense of learning. For example, a “lev-
OWL-based ontologies are currently being built for eled homomorphic” scheme can evaluate only polynomial
cybersecurity domains.17 functions of the input data limited to homomorphic

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
www.computer.org/security 51
BIG DATA PROBLEMS IN CYBERSECURITY

addition and multiplication of the data. In that case, learn- model. Finally, we may need to use more than one algo-
ing is also restricted to algorithms that can be expressed rithm to solve the same cybersecurity problem. ML algo-
as polynomials with bounded degrees (see https://eprint rithms should use different objective functions and feature
.iacr.org/2012/323.pdf). vectors to minimize the impact of a smart adversary.
We note that ML and cryptography have a duality Adversarial testing is different from testing correct-
because cryptography aims to prevent access and learn- ness of algorithms because it must consider malicious/
ing from the data, whereas ML attempts to extract knowl- adversarial input. There is a need to include a smart
edge from data.18 Machine-learning algorithms are used adversary during the testing stage of software produc-
to not only attack cryptosystems but also learn from tion because software engineers focus only on testing
(homomorphic) encrypted data (see https://eprint.iacr functionality and end-user requirement compliance.
.org/2012/323.pdf) and attack them. From an adversar- There are programming models that include the secu-
ial perspective, in case of HE, the noise terms grow, and rity aspect during the development stage, but they do
decryption is possible only as long as these noise terms not include a smart adversary. In cybersecurity research,
do not exceed a certain bound. A smart adversary can we first define an adversarial model to precisely describe
exploit this inherent weakness to attack cryptographic the capability of a hypothetical adversary and show that
databases using adversarial inputs. proposed security mechanisms (e.g., cryptographic
In addition to raw algorithms) can pro-
form, smart data sup- tect under that model.
port different levels of In AI/ML, we require
abstractions of data to a similar approach
be presented to appli- and must state the
cations and users. Row ML algorithms should use different objective vulnerability of algo-
data, e.g., may be pro- functions and feature vectors to minimize the rithms under different
jected onto a subspace impact of a smart adversary. adversarial models
to eliminate noise and (e.g., black box only
reduce dimensionality, versus open systems)
and they may be pre- to quantif y poten-
sented to some appli- tial bias and abuse.
cations in low rank. Thus, the need for
Thus, smart data and the algorithms using them can be adversarial training and testing of ML algorithms vali-
coupled together to provide different value propositions dates the motivation for coupling smart data with the
to users based on their needs and demands from the data. AI/ML algorithms that make use of these data.
However, as discussed in this article, data and algo- Finally, how do we go about implementing a smart
rithms must be examined from an adversarial perspec- data infrastructure (SDI)? Blockchain technologies and
tive. During the 1980s, the lack of consideration for cryptographic contracts can be building blocks for imple-
security properties in the design of Internet protocols menting the smart data concept. Blockchains for smart
demonstrated that focusing only on performance (with data can have different granularity; they can be as small
respect to accuracy and efficiency) created ongoing vul- as an individual’s financial records or as large as a data-
nerabilities and patches. Today, we experience a similar base containing user information for a social network.
situation occurring in the AI/ML landscape, where secu- As data are created, a new blockchain is initiated and
rity properties of algorithms are not considered as a part the data, their provenance information, and semantics
of design and implementation. Thus, we first must train are digitally signed. Each block contains semantic and
and test our algorithms by putting on black hats. provenance (initial or modification) information about
Adversarial training suggests adding malicious input to “usage of data,” e.g., what types of algorithms can use
the training set of a DNN as one of the defense mech- these data, what functions can be performed, what is the
anisms against a smart adversary. In particular, train- distribution, what populations the data represent, and
ing with an adversarial objective function based on the so on. Cryptographic contracts can be used to enforce
fast-gradient sign method is shown to be an effective reg- capability-based security on smart data to control
ularizer.13 Other defense techniques against adversarial “who” can do “what” and “how.” Each modification can
input include defensive distillation.6 It is shown that if the be recorded into a new block along with algorithms and
training set contains some adversarial input, the robust- access privileges. Thus, we see cryptographic contracts
ness of ML increases.14 This demands that smart data as the main tool for coupling smart data with their algo-
must contain some information and statistical tests15 to rithms while ensuring integrity, confidentiality, privacy,
label their vulnerability to exploits under an adversarial and auditability.

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
52 IEEE Security & Privacy November/December 2019
We can imagine data existing as blockchains in dif- 10. A. Bulazel and B. Yener, “A survey on automated dy­­
ferent confederates where each confederate may repre- namic malware analysis evasion and counter-evasion:
sent a different context for data. A confederate can be PC, mobile, and web,” in Proc. 1st Reversing and
defined within the members of a department of an orga- Offensive-Oriented Trends Symp. (ROOTS), 2017. doi:
nization or as large as between nation states collaborat- 10.1145/3150376.3150378.
ing on intelligence gathering and sharing. 11. I. J. Goodfellow et al., Generative adversarial networks. 2014.
The main motivations for such a distributed system [Online]. Available: https://arxiv.org/abs/1406.2661.
for implementing SDI are 1) to eliminate a single point 12. J. Jo and Y. Bengio, Measuring the tendency of CNNs to
of failure induced by a central solution (e.g., imagine learn surface statistical regularities. 2017. [Online]. Avail-
an adversary taking down the site with a DDoS attack able: https://arxiv.org/abs/1711.11561.
while in parallel trying to infect hosts that rely on the 13. I. Goodfellow, J. Shlens, and C. Szegedy, Explaining and
web-based authentication) and 2) the scalability of a harnessing adversarial examples. 2014. [Online]. Avail-
distributed solution (i.e., a multitrusted analyst with able: https://arxiv.org/abs/1412.6572.
accountability and auditability of their records). 14. A. Kurakin, I. Goodfellow, and S. Bengio, Adversarial
Clearly, smart data have high overhead and compu- machine learning at scale. 2016. [Online]. Available:
tational costs. Although they may not be suitable for all https://arxiv.org/abs/1611.01236.
applications, they are robust where data have high mon- 15. K. J. Grosse, P. Manoharan, N. Papernot, M. Backes, and
etary or mission-critical value, such as in the financial, P. D. McDaniel, On the (statistical) detection of adversarial
health care, energy, and defense industries. examples. 2017. [Online]. Available: https://arxiv.org/abs/
1702.06280.
Acknowledgment 16. J. Manuel, J. Salvio, and W. Low, “CVE-2017-11826
We would like to thank Can Yildizli of Prodaft for providing exploited in the wild with politically themed RTF docu-
the table in Figure 1. ment,” Fortinet, Nov. 22, 2017. [Online]. Available:
https://www.fortinet.com/blog/threat-research/cve
References -2017-11826-exploited-in-the-wild-with-politically
1. R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and -themed-rtf-document.html
M. Ahmadi, Microsoft malware classification chal- 17. L. F. Sikos, “OWL ontologies in cybersecurity: Concep-
lenge. 2018. [Online]. Available: https://arxiv.org/abs/ tual modeling of cyber-knowledge,” in AI in Cybersecurity,
1802.10135 L. F. Sikos, Ed. Cham, Switzerland: Springer, 2018. doi:
2. B. Battista, N. Blaine, and L. Pavel, “Poisoning attacks 10.1007/978-3-319-98842-9_1
against support vector machines,” in Proc. 29th Int. Conf. 18. Y. Ji, X. Zhang, and T. Wang, “Backdoor attacks against
Machine Learning (ICML), 2012, pp. 1467–1474. learning systems,” in Proc. IEEE Conf. Communications
3. Q. Liu, P. Li, W. Zhao, W. Cai, S. Yu, and V. Leung, “A and Network Security (CNS), 2017, pp. 1–9. doi: 10.1109/
survey on security threats and defensive techniques of CNS.2017.8228656.
machine learning: A data driven view,” IEEE Access, vol. 6,
pp. 12,103–12,117, Feb. 2018. Bülent Yener is a professor in the Department of Com-
4. P. McDaniel, N. Papernot, and C. Z. Berkay, “Machine puter Science and the founding director of the Rens-
learning in adversarial settings,” IEEE Security Privacy, selaer Polytechnic Institute Data Science Research
vol. 14, no. 3, pp. 68–72, 2016. Center, Troy, New York. Yener received a Ph.D. in
5. C. Szegedy et al., "Intriguing properties of neural net- computer science from Columbia University. He is a
works," in Proc. Int. Conf. Learning Representations, 2014. senior member of the ACM and a Fellow of the IEEE.
6. N. Papernot and P. McDaniel, On effectiveness of defen- Contact him at [email protected].
sive distillation. 2016. [Online]. Available: https://arxiv
.org/abs/1607.05113 Tsvi Gal is a managing director at Morgan Stanley.
7. D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, His research interests include artificial intelli-
K. Hansen, and K.-R. Muller, “How to explain individ- gence/machine learning , cybersecurity, cloud
ual classification decisions,” J. Mach. Learn. Res., vol. 11, computing, and ultralow-latency systems. Tsvi
pp. 1803–1831, June 2010. received an executive MBA in information tech-
8. P.-W. Koh and P. Percy Liang, Understanding black-box nology from the Golden Gate University, Cali-
predictions via influence functions. 2017. [Online]. Avail- fornia. He was the winner of Israel’s Einstein
able: https://arxiv.org/abs/1703.04730 Award for his pioneering work on online banking
9. R. Schwartz-Ziv and N. Tishby, Opening the black box and payments and was a member of the Group
of deep neural networks via information. 2017. [Online]. of Eight technology council. Contact him at tsvi
Available: https://arxiv.org/abs/1703.00810 [email protected].

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
www.computer.org/security 53

You might also like