Cybersecurity in The Era of Data Science Examining New Adversarial Models

Uploaded by

Iván Mauricio Cabezas Troyano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

51 views8 pages

Cybersecurity in The Era of Data Science Examining New Adversarial Models

Uploaded by

Iván Mauricio Cabezas Troyano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

BIG DATA PROBLEMS IN CYBERSECURITY

Cybersecurity in the Era of Data Science:

Examining New Adversarial Models
Bülent Yener | Rensselaer Polytechnic Institute
Tsvi Gal | Morgan Stanley

The ever-increasing volume, variety, and velocity of threats dictates a big data problem in cybersecurity
and necessitates deployment of AI and machine-learning (ML) algorithms. The limitations and
vulnerabilities of AI/ML systems, combined with complexity of data, introduce a new adversarial model,
which is defined and discussed in this article.

D ata science provides tools for synthesizing large

amounts of data very quickly, improving running
hypotheses (in supervised and reinforced methods) as well
massive (unstructured) data models, new techniques,
such as generative adversarial networks (GANs), are
moving into the mainstream of tools being used. The
as identifying unforeseen patterns (so-called unknown term AI refers to the discipline that includes knowledge
unknowns in the unsupervised method). It enables sci- representation, natural language processing (NLP),
entists to extract knowledge, identify patterns, and build planning, ontologies, and ML. Thus, although ML is
predictive models from complex data characterized by 1) the focus of this article, it is a subarea in AI.
very large volumes (terabytes to petabytes), 2) fast arrival/ As we discuss in the “Trustworthiness of Data” sec-
generation rate (velocity) by multiple heterogeneous and tion, solutions must borrow from other areas in AI, such
noisy (e.g., dirty, not useful, or misleading) data sources, as ontologies and semantics. AI/ML algorithms can be
and 3) different modalities (variety). Recently, new threats, used for clustering, classifying, or predicting events of
such as ransomware, introduced another feature for com- interest. For example, various classifiers, including deep
plex data: their ever-increasing value. neural networks (DNNs), are widely used in 1) distin-
Various AI/ML techniques have been successfully guishing malware from benign programs and 2) classi-
deployed in cyberdefense for access control, traffic fying malware to one of the known families.1 However,
analysis, spam detection, anomaly detection, and intru- these sophisticated algorithms have fundamental limi-
sion detection. User and entity behavioral analytic tech- tations, which can be intentionally abused to misclas-
niques analyze behavior logs and network traffic in real sify a malware, destroy the reputation of a company,
time to respond to an attack. evade an anomaly detection system, and even attack
All of these systems are based on either training ML the underlying cryptographic systems. Thus, in this era
algorithms on labeled data for supervised learning or of “big data” when we rely more and more on AI/ML
deploying unsupervised techniques to cluster unlabeled techniques, a new adversarial model emerges with new
data, or some combination of labeled and unlabeled threat models that must be understood, along with a
data for semisupervised learning tasks. As modern AI discussion about how to encounter it.
and ML move from rule-based limited data systems to
Challenges Resulting From Big Data
Digital Object Identifier 10.1109/MSEC.2019.2907097
Training AI/ML algorithms and building models
Date of current version: 3 May 2019 for complex data are a challenge for several reasons,

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
46 November/December 2019 Copublished by the IEEE Computer and Reliability Societies 1540-7993/19©2019IEEE
including 1) noisy (from filtering) data streams, 2) data correlation of traffic flows as well as build hypotheses for
coming from multiple sensors (indicating a data fusion predicting threats (potentially against an adversary that
problem), and 3) data having different modalities (e.g., may also use AI/ML techniques to evade our systems).
multiway models). Learning from multiple heteroge-
neous data streams requires multivariate models and High-Velocity Data Analysis Is Complicated
computing complex distributions (e.g., prior probability In the analysis of big data, velocity may enforce a limi-
distributions or density functions). In many cases, the tation on how many data can be processed in real time
exact computation of these distributions is not feasible (or near real time) and may require working with
(due to unbounded partial information
sums, infinite inte- to compute a function
grals, and so on); on the data. The need
thus, AI/ML can pro- for a timely response
vide only approximate (interactive applica-
solutions, which are One approach used in data science is called tions) and the limi-
found in some local sketching, which produces approximate results tations of compute
optimum. The impli- that are orders of magnitude faster and have resources (scalability)
cations of these limi- provable error bounds. prohibit generating
tations are not well exact results. One
understood for the approach used in data
cyberdefense systems science is called sketch-
that rely on automa- ing, which produces
tion and AI/ML tech- approximate results
niques. Some specific challenges due to the nature of big that are orders of magnitude faster and have provable
data are mentioned in the following sections. error bounds. However, sketches may capture only lim-
ited temporal states (sliding window model), create
High-Volume Threats Without Robust artifacts (graph sparsification), and may be limited
Attribution (e.g., computing entropy for detecting outliers). The
The pervasive and ubiquitous nature of the Internet smart adversary model can exploit these limitations and
makes distributed and large-scale attacks [e.g., distributed evade defense systems, which rely on real-time pro-
denial-of-service (DDoS) attacks] very effective tools for cessing of data.
an adversary.3 Although these threats are known to the The essence of the problem is the difficulty of learn-
cyberdefense community, there are two new factors: ing patterns and making predictions from fast-moving
1) know-how is not limited, and sophisticated attackers time-series data. For certain applications, simple sta-
are easily found in the cyberworld [e.g., Internet Relay tistical techniques such as a moving average computed
Chats (IRCs)] and 2) underground economies in the in a small window size (e.g., in milliseconds) can be
darknet provide overwhelming computing power. a useful indicator, but in cybersecurity applications,
Because it is possible to buy malicious software, ranging such averaging techniques may not be as useful. The
from packers to zero-day malware, and rent botnets for patterns of interest can be detected by a combina-
less than US$1 per machine in underground markets, tion of multiple features in a high-dimensional space.
deciding the attribution for an attack is very difficult. For For example, (near) real-time anomaly detection
example, a 16-year-old who learned how to implement and outlier identification problems require extract-
DDoS attacks from experts used available code from an ing a high-dimensional feature vector from the data
IRC and, wielding a botnet, launched a DDoS attack streams, which can be used to identify patterns by
against Yahoo, eBay, and Amazon, causing an estimated learning algorithms. Similarly, analyzing incoming
US$1.2 billion in damages (see https://en.wikipedia http traffic streams for detecting phishing attacks may
.org/wiki/MafiaBoy). require examining the entire page and maintaining a
Technologies such as botnets, digital cash, and anon- much larger window of operation.
ymous communications (e.g., Tor) can be abused by vir- A particularly challenging problem in high-velocity
tually any entity (person, organization, or state) that can data streams is detecting phase transitions. This is also
buy malware, rent botnets, and embark on a cyberattack. known as the change point detection (CPD) problem and
For example, in Figure 1 we show how readily available concerns estimating probability distributions over a
it is to rent or buy tools for a cyberattack. Thus, we must high-dimensional space created by latent features that
consider sophisticated defense systems that use AI/ control the observed data. A smart adversary equipped
ML techniques to enable the automatic detection and with adversarial input can modify the shapes of the

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
www.computer.org/security 47
BIG DATA PROBLEMS IN CYBERSECURITY

Figure 1. A screenshot of available botnets in an underground market where digital cash could be spent.

learned distributions (e.g., flatter and overlapping dis- A New Threat Model: Smart Adversary
tributions) to evade CPD systems. Implicit in these We define a new adversarial model: the smart adversary,
examples is the multiplicity of data with respect to dif- in which an adversary uses sophisticated techniques
ferent sources and modalities, which require multivari- for defensive purposes against its targets. For exam-
ate treatment of data. There are well-known scalability ple, ransomware attacks demonstrate how a sophisti-
and feasibility challenges that extend univariate learn- cated adversary can use cryptography for protection
ing models to multivariate ones. Addressing the CPD and as a weapon against its victims. Similarly, a smart
problem in high-velocity, multivariate data streams is a adversary can abuse AI/ML techniques to beat intel-
particularly difficult one. ligent cyberdefense systems that rely on AI/ML. Such
Because modern attacks take a long-term view of com- attacks can target the training data, test data, and model
promising their target, they are no longer single attacks parameters used by AI/ML algorithms depending on
or even a concentrated barrage attack; instead, they are the underlying white-box (open) or black-box (closed)
continuous, sophisticated, and mostly undetected. They architecture.
often aim to infiltrate an entire network, with the goal of The goals of a smart adversary may include eroding
remaining sustainable once they succeed. Advanced per- confidence and trust in the underlying AI/ML system
sistent threats (APTs) move from infiltration to expan- (by way of introducing false negatives and false posi-
sion and extraction and are able to cover their tracks. APT tives), targeting a particular class for misclassifications,
attacks differ from traditional web application threats in and evading automated detection by obfuscation.
that they are more complex, target an entire network, The techniques a smart adversary can deploy
and, most importantly, are persistent. We will revisit this depend on the target system. Open-box systems are
issue in the “A New Threat Model: Smart Adversary” sec- transparent; the internal states, data, and gradient cal-
tion, where we discuss insider threats. culation are openly available, while in black-box sys-
The continuous nature mentioned previously means tems, only the input–output relationship is observed.
that once an environment is compromised, the adver- For example, in black-box systems, an adversary may
sary stays put to acquire as much information as simulate the behavior of ML algorithms on surrogate
possible. These attacks can take a variety of forms, training data sets and then perform a sensitivity analysis
such as an SQL injection, remote file inclusion, and to understand the relationship between data instances
cross-site scripting. Their persistent nature makes and accuracy of the algorithms. This understanding
these attacks especially difficult to detect and fight with can be used to design causative attacks on learning
tools such as defense in depth, firewalls, and antivi- algorithms by poisoning the training data (e.g., poi-
rus (AV) programs. soning attacks on support vector machines).2 On the

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
48 IEEE Security & Privacy November/December 2019
other hand, if the internal states of the ML algorithm by a decision.” This requirement asserts that the under-
targeted by an adversary are known (i.e., open-box sys- lying AI/ML model explains what the parameters of the
tems), then exploratory attacks that exploit the algo- model are and how they operate to reach their decision.
rithm become possible. There are several ways AI can be explained when it is
In both classes of attacks, the adversary carefully used in a risk or regulatory context. Scoring algorithms,
chooses specific data points to perturb depending e.g., inject noise and artificial data points around a real
on the algorithm used to create an adversarial input. point and observe what features impact the score (such as
Deploying an adversarial input against a cybersecurity local, interpretable model-agnostic explanations). Others
system that uses AI/ML is similar to zero-day exploits include sensitivity analysis,7 influence curves,8 relevance
and can have serious consequences. propagation, and an information bottleneck that estab-
lishes a tradeoff between compression and prediction.9
Adversarial Input Exploits Instability There is an irony in the explainability of ML algo-
in DNN Architectures rithms and adversarial learning: although understanding
All ML algorithms are vulnerable to adversarial input the internal operations of black-box learning algorithms
(even the linear models), but DNNs are particularly (e.g., gradient based) is very desirable, the techniques
interesting because of the nonconvexity and nonlin- used to decipher them can also be used to choose the
earity of the learned models.3,4 We can model a DNN best data points to abuse them. For example, sensitivity
by a multidimensional function F: X " Y, where X is analysis of a DNN would show us how changes in input
d-dimensional feature (input) vector and Y is an output will impact the output (i.e., it quantifies the importance
vector (e.g., probability assignments to each “class” for of each input variable i on the observed change of func-
input X). If a DNN has n layers, then we can decompose tion (x)) as R i = ; (d/dx i) f (x) ;.
F as F (X) = (Fn(Fn - 1(f F1(X, i 1), f i n - 1), i n), where By computing the forward or backward derivatives
ii is the trained parameters at layer i. (depending on the type of the DNN architecture), one
An adversarial data point X can be constructed by may gain some insight into how the algorithms works.
adding a minimum perturbation dx so that X l = X + d X For instance, if the algorithm is open box, a smart adver-
and mind
< d X < such that F (X l ) has a label Y l that is differ- sary can deploy the same techniques to choose input
X
ent from Y, thus causing a misclassification. The inher- data that will maximize the misclassification or discrim-
ent instability of a DNN with n layers under adversarial ination against a subpopulation by positioning the train-
input is attributed to how perturbation to input X prop- ing set or eroding trust. If the algorithm is black box, a
agates over n layers. The impact of perturbation can be smart adversary can deploy transfer-learning techniques
bounded by the Lipschitz constant at each layer i, and, by developing a surrogate model that emulates the tar-
overall, the perturbation has an avalanche effect because get algorithm. Thus, our efforts to understand ML bet-
it grows as5 ter also create a more sophisticated adversary.
< Fi(X) - Fi(X + d X) < # L i < d X <
n
Decoys and Obfuscation as
" < F (X) - F (X + d X) < # < d X < % L i . Adversarial Input
i=1 One application of adversarial data generation in the
There is a tradeoff between adding more layers to a cybersecurity domain is malware obfuscation, i.e.,
DNN architecture for reduced training time and its how to modify the code to evade static malware detec-
increased instability. Current architectural consider- tion algorithms while ensuring that malicious behavior
ations for DNNs are focused on accuracy-performance remains unchanged. Obfuscation successfully defeats
tradeoffs and do not address the impact of adversarial signature-based AV systems.10
input on the architecture choices, although such consid- In dynamic analysis environments, malware is exe-
erations may provide some level of defense.6 cuted in a sandbox and its behavior is examined. How-
ever, new-generation malware is smart enough to detect
Explainable AI/ML Versus if it is being instrumented and avoids dropping its pay-
Adversarial Input load. It is particularly important for zero-day attacks
There is increased demand for ML algorithms to explain to drop the payload when there is no instrumentation
and justify how they arrive at a particular result. For exam- (e.g., no sandboxing). As a result, defensive technolo-
ple, the General Data Protection Regulation (GDPR) gies (AV systems) try to obfuscate their existence and
defines the law on data protection and privacy for all avoid detection so as to “fool” the malware to exhibit its
individuals who live in the European Union. Accord- malicious behavior in a sandbox.10
ing to Article 22 of the GDPR, customers must have It is not farfetched to imagine smart malware that
“clear-cut reasons for how they were adversely impacted can access a website and utilize an AI/ML system to

decide whether it is interacting with a native or instru- adversarial behavior is spread over time and composed
mented environment. In this scenario, an adversary of low-intensity steps to stay under the radar. ML algo-
(malware) can train a binary classifier to distinguish rithms that deploy past information, such as hidden
“fake” environments from “native” ones. To counter Markov models and long short-term memory architec-
such an adversary, defensive technologies must gener- tures, can be used to capture the behaviors of each user,
ate sandboxes that can fool the distinguisher (Dis). This and anomaly detection mechanisms can be fine-tuned
scenario resembles a typical GAN11 model where the at the expense of scalability.
adversary is the Dis and defensive technology is the
generator (Gen) component. There is a zero-sum game Trustworthiness of Data
played between the Dis and Gen, which establishes the As cyberthreats become increasingly pervasive and
limits of best-possible obfuscation. global, nation states, law enforcement agencies, and pri-
Decoy attacks are common tools for obfuscation. vate organizations share potential v ulnerabilities,
For example, a zero-day attack can be wrapped around e.g., zero-day exploits, new patterns of attacks, and
by an easier-to-catch malware as a decoy. If an AV system the intelligence information to defend. However, such
takes the bait and catches the decoy malware, the actual information sharing and collaboration are vulnerable to
zero-day exploit will not be dropped or unpacked. information warfare in which an adversary may inject
As mentioned previously, we saw the first case of malicious or deceptive information to poison the col-
ransomware being used as a decoy in late 2015. How- laborative defense systems.
ever, by the end of 2016, more targeted attack groups Manufacturing and disseminating malicious infor-
had adopted the tactic. One of the most high-profile mation (e.g., news, rumors, opinions, and malicious
examples was the Sandworm cyberespionage group, sensor data) have always been important tools in infor-
which created a new version of its destructive Disakil mation warfare; however, they have recently become a
Trojan (Trojan.Disakil), disguised as ransomware. As serious threat to national security because many nations
with DDoS attacks, using ransomware as a decoy had rely more and more on news streams from the Internet
a similar effect, sowing confusion among the victims in the form of social networks, blogs, instant messages,
and delaying an effective response. A recent example emails, and so on.
was used when a politically themed document was Recent advances in AI, including ML and NLP, cre-
circling in the Philippines (see https://www.fortinet ated interest in how these systems can be deployed to
.com/blog/threat-research/hussarini—targeted-cyber combat the adversarial and malicious data problem.
-attack-in-the-philippines.html).16 Unfortunately, current state-of-the-art methods in these
fields demonstrate a very modest level of success, which
Eroding Trust indicates that more research must be done in these areas.
A smart adversary can manipulate AI/ML algorithms Furthermore, deep ML algorithms are themselves subject
to misclassify input data to a target class. An adversary to adversarial data and can be manipulated to malfunction
may, e.g., manipulate the learning algorithm to maxi- with high confidence, as mentioned previously. An early
mize false-positive rates, which would disrupt normal poisoning of mission-critical shared data can be very dif-
operations and create a trust problem for the ML algo- ficult to detect and attribute. A promising approach that
rithms. The target class can be chosen to attack a par- creates and maintains distributed trust for shared data is
ticular employee, client, or subpopulation depending adapting blockchain technologies to address many (not
on the context. all) of the concerns about sharing mission-critical data. A
A specific power adversarial model that combats AI/ blockchain can be created when data are generated, and
ML systems is the insider threat model. Insider attackers all access to and modification of the data can be enforced
operate behind firewalls in semitrusted environments by cryptographic contracts, thus providing accountabil-
because they may be legitimate employees, contrac- ity and auditability for forensics purposes. This phe-
tors, or collaborators with access to how AI/ML algo- nomenon, termed universal error, suggests that an error
rithms work, what kind of rule base they implement, made (or maliciously caused) early enough in the chain
what red flags exist, and so on. As a result, they aim to would be accepted as a “golden source” (i.e., as ultimate
avoid raising a red flag by not violating rules. For exam- truth) and transferred as such to all downstream systems,
ple, knowing the operational details of how an anom- causing the entire ecosystem to accept this erroneous
aly detection system may work in his/her company, an data point as “fact.” Additionally, given system intercon-
insider attacker may divide the malicious behavior into nectivity and the self-correction modules within many
minuscule and inconspicuous steps to conceal them in institutions, a well-placed false fact can be duplicated and
a stream of legitimate behaviors of benign users. Insider accepted by many or all participants in the network as
attacks are examples of complex attack vectors in which truth, making it a universal error.

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
50 IEEE Security & Privacy November/December 2019
An important concept in cybersecurity is the attribu- However, smart data are aware of what algorithms,
tion of malicious acts, ranging from DDoS attacks to a statistical tests, and abstractions can be executed while
collection of malicious users’ accounts in social media, ensuring integrity, access control, confidentiality, and
instigating an information warfare campaign. Further- auditability. For example, consider a protected health
more, malicious accounts can be used for cybersecurity information record (PHIR) that includes diagnosis,
attacks, including impersonation, hacking user accounts, medical test results, and demographic and insurance
and spreading malware. Attribution in information warfare information. A smart PHIR would allow certain fields
requires finding statistical patterns that indicate correlated to be accessed only by medical personnel, others only
and coordinated similar activities. AI/ML techniques that by insurance companies, and yet other fields only by
address attribution problems encounter similar problems data mining companies. Depending on the value of
faced when combating spam messages because of a lack of each field, a smart PHIR may deploy different fees and
semantic interpretation or some level of contextual under- charging mechanisms where the integrity of each data
standing. It is relatively easy to deceive an AI/ML system field is enforced by digital signatures and cryptographic
and cause both false positives and false negatives by intro- check sums. PHIRs can be generalized to other data-
ducing misinformation and noise; therefore, a smart adver- base applications (e.g., human resource, defense, or
sary currently has the upper hand in deception. advanced manufacturing) where different data fields
have different security and semantic properties. Smart
data may not trust the database server and require an

A s we move toward a data-centric world in which

data are increasingly treated like commodities, it
is no longer enough to think of data as spreadsheets or
encrypted database in which operations only on the
ciphertext are allowed. A database that keeps data unen-
crypted is vulnerable to multiple attacks.
tables because applications demand answers to ques- Many security breaches of government databases
tions such as what is the origin of these data; what have been reported in recent years. If a database is not
are the semantics and how do they change over time; encrypted, then data may be leaked (i.e., a breach of con-
how and by whom are data modified; and who owns fidentiality) or cross-correlated (i.e., a breach of privacy).
the data. These questions are related to “trustworthi- If the data are kept in the cloud, a smart adversary can
ness” (integrity, confidentiality, authenticity, and audit- potentially mishandle calculations, delete data, refuse to
ability) of data, and some of them can be answered by return results, collude with other parties, and so on. An
data provenance and lineage techniques. However, data adversary can ruin the reputation of a cloud business with
provenance provides a contextual and historical record such an attack as well. However, protecting confidential-
of passive data but cannot address more elaborate or ity, integrity, and privacy of data creates a tradeoff with
algorithmic questions such as how do we verify and what can be learned and mined from it.
resolve semantic inconsistencies; what are the descrip- A compromise would be to deploy capability-based
tive and predictive features of data; and what is the value access mechanisms (to specify which fields of a PHIR can
of data with respect to statistical and machine-learning be accessed by whom and what operations are allowed
tasks. These are particularly important questions for on these fields) along with more-advanced encryption
ML algorithms that extract the features directly from techniques that can operate on the ciphertext. These
data without human intervention or domain expertise. techniques include 1) homomorphic encryption (HE),
It has been suggested that DNNs do not learn the which allows some aggregate statistics to be computed
semantics and high-level abstractions in the data set on the encrypted data only (e.g., average blood pressure
but learn only simple surface statistical regularities: over the last five years); 2) functional encryption (FE),
thus, ML models generalize well without ever having to which controls the fields that can be accessed by autho-
explicitly learn abstract concepts12 (perhaps this is one rized parties [in FE, given a ciphertext C(m) for a mes-
of the reasons why CNNs are vulnerable to adversarial sage m and a secret key for a function f, a user can learn
examples). Finally, questions pertaining to the trust- only the value f (m)]; or 3) attribute-based encryption,
worthiness of data must be addressed. in which the secret key provides an access formula that
Perhaps what is needed is a new, algorithmic and operates over a set of n attributes that must evaluate to
value-based perspective for data-as-a-commodity that true for decryption to yield the message.
is designed to be active and smart while keeping its Although the cryptographic techniques mentioned
provenance and lineage intact. Smart data rely on 1) in this section are not widely in practice due to their high
data provenance and lineage information and 2) ontolo- complexity, their lightweight implementations are attrac-
gies that define their context and associated semantics. tive, even at the expense of learning. For example, a “lev-
OWL-based ontologies are currently being built for eled homomorphic” scheme can evaluate only polynomial
cybersecurity domains.17 functions of the input data limited to homomorphic

addition and multiplication of the data. In that case, learn- model. Finally, we may need to use more than one algo-
ing is also restricted to algorithms that can be expressed rithm to solve the same cybersecurity problem. ML algo-
as polynomials with bounded degrees (see https://eprint rithms should use different objective functions and feature
.iacr.org/2012/323.pdf). vectors to minimize the impact of a smart adversary.
We note that ML and cryptography have a duality Adversarial testing is different from testing correct-
because cryptography aims to prevent access and learn- ness of algorithms because it must consider malicious/
ing from the data, whereas ML attempts to extract knowl- adversarial input. There is a need to include a smart
edge from data.18 Machine-learning algorithms are used adversary during the testing stage of software produc-
to not only attack cryptosystems but also learn from tion because software engineers focus only on testing
(homomorphic) encrypted data (see https://eprint.iacr functionality and end-user requirement compliance.
.org/2012/323.pdf) and attack them. From an adversar- There are programming models that include the secu-
ial perspective, in case of HE, the noise terms grow, and rity aspect during the development stage, but they do
decryption is possible only as long as these noise terms not include a smart adversary. In cybersecurity research,
do not exceed a certain bound. A smart adversary can we first define an adversarial model to precisely describe
exploit this inherent weakness to attack cryptographic the capability of a hypothetical adversary and show that
databases using adversarial inputs. proposed security mechanisms (e.g., cryptographic
In addition to raw algorithms) can pro-
form, smart data sup- tect under that model.
port different levels of In AI/ML, we require
abstractions of data to a similar approach
be presented to appli- and must state the
cations and users. Row ML algorithms should use different objective vulnerability of algo-
data, e.g., may be pro- functions and feature vectors to minimize the rithms under different
jected onto a subspace impact of a smart adversary. adversarial models
to eliminate noise and (e.g., black box only
reduce dimensionality, versus open systems)
and they may be pre- to quantif y poten-
sented to some appli- tial bias and abuse.
cations in low rank. Thus, the need for
Thus, smart data and the algorithms using them can be adversarial training and testing of ML algorithms vali-
coupled together to provide different value propositions dates the motivation for coupling smart data with the
to users based on their needs and demands from the data. AI/ML algorithms that make use of these data.
However, as discussed in this article, data and algo- Finally, how do we go about implementing a smart
rithms must be examined from an adversarial perspec- data infrastructure (SDI)? Blockchain technologies and
tive. During the 1980s, the lack of consideration for cryptographic contracts can be building blocks for imple-
security properties in the design of Internet protocols menting the smart data concept. Blockchains for smart
demonstrated that focusing only on performance (with data can have different granularity; they can be as small
respect to accuracy and efficiency) created ongoing vul- as an individual’s financial records or as large as a data-
nerabilities and patches. Today, we experience a similar base containing user information for a social network.
situation occurring in the AI/ML landscape, where secu- As data are created, a new blockchain is initiated and
rity properties of algorithms are not considered as a part the data, their provenance information, and semantics
of design and implementation. Thus, we first must train are digitally signed. Each block contains semantic and
and test our algorithms by putting on black hats. provenance (initial or modification) information about
Adversarial training suggests adding malicious input to “usage of data,” e.g., what types of algorithms can use
the training set of a DNN as one of the defense mech- these data, what functions can be performed, what is the
anisms against a smart adversary. In particular, train- distribution, what populations the data represent, and
ing with an adversarial objective function based on the so on. Cryptographic contracts can be used to enforce
fast-gradient sign method is shown to be an effective reg- capability-based security on smart data to control
ularizer.13 Other defense techniques against adversarial “who” can do “what” and “how.” Each modification can
input include defensive distillation.6 It is shown that if the be recorded into a new block along with algorithms and
training set contains some adversarial input, the robust- access privileges. Thus, we see cryptographic contracts
ness of ML increases.14 This demands that smart data as the main tool for coupling smart data with their algo-
must contain some information and statistical tests15 to rithms while ensuring integrity, confidentiality, privacy,
label their vulnerability to exploits under an adversarial and auditability.

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
52 IEEE Security & Privacy November/December 2019
We can imagine data existing as blockchains in dif- 10. A. Bulazel and B. Yener, “A survey on automated dy
ferent confederates where each confederate may repre- namic malware analysis evasion and counter-evasion:
sent a different context for data. A confederate can be PC, mobile, and web,” in Proc. 1st Reversing and
defined within the members of a department of an orga- Offensive-Oriented Trends Symp. (ROOTS), 2017. doi:
nization or as large as between nation states collaborat- 10.1145/3150376.3150378.
ing on intelligence gathering and sharing. 11. I. J. Goodfellow et al., Generative adversarial networks. 2014.
The main motivations for such a distributed system [Online]. Available: https://arxiv.org/abs/1406.2661.
for implementing SDI are 1) to eliminate a single point 12. J. Jo and Y. Bengio, Measuring the tendency of CNNs to
of failure induced by a central solution (e.g., imagine learn surface statistical regularities. 2017. [Online]. Avail-
an adversary taking down the site with a DDoS attack able: https://arxiv.org/abs/1711.11561.
while in parallel trying to infect hosts that rely on the 13. I. Goodfellow, J. Shlens, and C. Szegedy, Explaining and
web-based authentication) and 2) the scalability of a harnessing adversarial examples. 2014. [Online]. Avail-
distributed solution (i.e., a multitrusted analyst with able: https://arxiv.org/abs/1412.6572.
accountability and auditability of their records). 14. A. Kurakin, I. Goodfellow, and S. Bengio, Adversarial
Clearly, smart data have high overhead and compu- machine learning at scale. 2016. [Online]. Available:
tational costs. Although they may not be suitable for all https://arxiv.org/abs/1611.01236.
applications, they are robust where data have high mon- 15. K. J. Grosse, P. Manoharan, N. Papernot, M. Backes, and
etary or mission-critical value, such as in the financial, P. D. McDaniel, On the (statistical) detection of adversarial
health care, energy, and defense industries. examples. 2017. [Online]. Available: https://arxiv.org/abs/
1702.06280.
Acknowledgment 16. J. Manuel, J. Salvio, and W. Low, “CVE-2017-11826
We would like to thank Can Yildizli of Prodaft for providing exploited in the wild with politically themed RTF docu-
the table in Figure 1. ment,” Fortinet, Nov. 22, 2017. [Online]. Available:
https://www.fortinet.com/blog/threat-research/cve
References -2017-11826-exploited-in-the-wild-with-politically
1. R. Ronen, M. Radu, C. Feuerstein, E. Yom-Tov, and -themed-rtf-document.html
M. Ahmadi, Microsoft malware classification chal- 17. L. F. Sikos, “OWL ontologies in cybersecurity: Concep-
lenge. 2018. [Online]. Available: https://arxiv.org/abs/ tual modeling of cyber-knowledge,” in AI in Cybersecurity,
1802.10135 L. F. Sikos, Ed. Cham, Switzerland: Springer, 2018. doi:
2. B. Battista, N. Blaine, and L. Pavel, “Poisoning attacks 10.1007/978-3-319-98842-9_1
against support vector machines,” in Proc. 29th Int. Conf. 18. Y. Ji, X. Zhang, and T. Wang, “Backdoor attacks against
Machine Learning (ICML), 2012, pp. 1467–1474. learning systems,” in Proc. IEEE Conf. Communications
3. Q. Liu, P. Li, W. Zhao, W. Cai, S. Yu, and V. Leung, “A and Network Security (CNS), 2017, pp. 1–9. doi: 10.1109/
survey on security threats and defensive techniques of CNS.2017.8228656.
machine learning: A data driven view,” IEEE Access, vol. 6,
pp. 12,103–12,117, Feb. 2018. Bülent Yener is a professor in the Department of Com-
4. P. McDaniel, N. Papernot, and C. Z. Berkay, “Machine puter Science and the founding director of the Rens-
learning in adversarial settings,” IEEE Security Privacy, selaer Polytechnic Institute Data Science Research
vol. 14, no. 3, pp. 68–72, 2016. Center, Troy, New York. Yener received a Ph.D. in
5. C. Szegedy et al., "Intriguing properties of neural net- computer science from Columbia University. He is a
works," in Proc. Int. Conf. Learning Representations, 2014. senior member of the ACM and a Fellow of the IEEE.
6. N. Papernot and P. McDaniel, On effectiveness of defen- Contact him at [email protected].
sive distillation. 2016. [Online]. Available: https://arxiv
.org/abs/1607.05113 Tsvi Gal is a managing director at Morgan Stanley.
7. D. Baehrens, T. Schroeter, S. Harmeling, M. Kawanabe, His research interests include artificial intelli-
K. Hansen, and K.-R. Muller, “How to explain individ- gence/machine learning , cybersecurity, cloud
ual classification decisions,” J. Mach. Learn. Res., vol. 11, computing, and ultralow-latency systems. Tsvi
pp. 1803–1831, June 2010. received an executive MBA in information tech-
8. P.-W. Koh and P. Percy Liang, Understanding black-box nology from the Golden Gate University, Cali-
predictions via influence functions. 2017. [Online]. Avail- fornia. He was the winner of Israel’s Einstein
able: https://arxiv.org/abs/1703.04730 Award for his pioneering work on online banking
9. R. Schwartz-Ziv and N. Tishby, Opening the black box and payments and was a member of the Group
of deep neural networks via information. 2017. [Online]. of Eight technology council. Contact him at tsvi
Available: https://arxiv.org/abs/1703.00810 [email protected].

Authorized licensed use limited to: Universidad Industrial de Santander. Downloaded on October 16,2024 at 04:17:15 UTC from IEEE Xplore. Restrictions apply.
www.computer.org/security 53

Data Analytics For CyberSecurity
100% (5)
Data Analytics For CyberSecurity
207 pages
Artificial Intelligence Assignment
70% (10)
Artificial Intelligence Assignment
5 pages
Analysis of Continual Learning Models For Intrusio
No ratings yet
Analysis of Continual Learning Models For Intrusio
22 pages
Unit1-5notes
No ratings yet
Unit1-5notes
121 pages
Abstract NT
No ratings yet
Abstract NT
33 pages
Final Document - Data Fingerprinting and Visualization For AI
No ratings yet
Final Document - Data Fingerprinting and Visualization For AI
60 pages
A Review On The Effectiveness of Machine Learning and Deep Learning Algorithms For Cyber Security
No ratings yet
A Review On The Effectiveness of Machine Learning and Deep Learning Algorithms For Cyber Security
19 pages
11-MLSecurity
No ratings yet
11-MLSecurity
42 pages
Lecture 2.2
No ratings yet
Lecture 2.2
19 pages
Volume 4 Issue 4 3 AJSTEME
No ratings yet
Volume 4 Issue 4 3 AJSTEME
17 pages
CYT180Week1 - Data Analytics and Cybersecurity
No ratings yet
CYT180Week1 - Data Analytics and Cybersecurity
25 pages
Aids
No ratings yet
Aids
20 pages
CYBER ATTACKS DETECTION USING GoogleNet MODEL FOR ENVIRONMENTAL AWARE SMART CITY APPLICATIONS
No ratings yet
CYBER ATTACKS DETECTION USING GoogleNet MODEL FOR ENVIRONMENTAL AWARE SMART CITY APPLICATIONS
10 pages
Deep Learning Approach For Intelligent Intrusion Detection System
No ratings yet
Deep Learning Approach For Intelligent Intrusion Detection System
5 pages
AI-driven Threat Intelligence For Real-Time Cybers
No ratings yet
AI-driven Threat Intelligence For Real-Time Cybers
9 pages
Data Fingerprinting and Visualization For AI
No ratings yet
Data Fingerprinting and Visualization For AI
15 pages
Cyber Security Meets Artificial in
No ratings yet
Cyber Security Meets Artificial in
13 pages
A Review On The Effectiveness of Machine Learning and Deep Learning Algorithms For Cyber Security
100% (1)
A Review On The Effectiveness of Machine Learning and Deep Learning Algorithms For Cyber Security
19 pages
142-160 Ijaeti 2021
No ratings yet
142-160 Ijaeti 2021
19 pages
BDCC-07-00065
No ratings yet
BDCC-07-00065
26 pages
How AI and Machine Learning Improve Enterprise Cybersecurity
No ratings yet
How AI and Machine Learning Improve Enterprise Cybersecurity
4 pages
Ai ML
No ratings yet
Ai ML
9 pages
Ai Project Cycle
No ratings yet
Ai Project Cycle
30 pages
A Review On Cybersecurity Based On Machine Learning
No ratings yet
A Review On Cybersecurity Based On Machine Learning
13 pages
Artificial Intelligence Trendsand Challenges
No ratings yet
Artificial Intelligence Trendsand Challenges
7 pages
Detection and Prevention of Cyber Defense Attacks Using Machine Learning Algorithms
No ratings yet
Detection and Prevention of Cyber Defense Attacks Using Machine Learning Algorithms
10 pages
Deep Reinforcement Learning For Cyber Security
No ratings yet
Deep Reinforcement Learning For Cyber Security
17 pages
Advancing Cybersecurity: A Comprehensive Review of AI-driven Detection Techniques
100% (1)
Advancing Cybersecurity: A Comprehensive Review of AI-driven Detection Techniques
38 pages
Deep Cybersecurity - A Comprehensive Overview From Neural Network and Deep Learning Perspective
No ratings yet
Deep Cybersecurity - A Comprehensive Overview From Neural Network and Deep Learning Perspective
17 pages
Chapter-14 Data Science
No ratings yet
Chapter-14 Data Science
12 pages
BATCH 4 (1) Review
No ratings yet
BATCH 4 (1) Review
19 pages
Doubt Clearance Session (AI) On 29.12.2024
No ratings yet
Doubt Clearance Session (AI) On 29.12.2024
41 pages
Detecting - Conventional - and - Adversarial - Attacks - Using - Deep - Learning - Techniques - A - Systematic - Review
No ratings yet
Detecting - Conventional - and - Adversarial - Attacks - Using - Deep - Learning - Techniques - A - Systematic - Review
7 pages
Deep Convolutional Neural Networks For Intrusion Detection in Automotive Ethernet Networks
No ratings yet
Deep Convolutional Neural Networks For Intrusion Detection in Automotive Ethernet Networks
6 pages
Dr. Mujiono - MachineLearningApplicationsCyberSecurity-Final-MS
No ratings yet
Dr. Mujiono - MachineLearningApplicationsCyberSecurity-Final-MS
28 pages
Learning Cyber Security and Machine Engineering at The University
No ratings yet
Learning Cyber Security and Machine Engineering at The University
6 pages
Advanced Techniques of Artificial Intelligence in IT Security Systems by Marcin Korytkowski
No ratings yet
Advanced Techniques of Artificial Intelligence in IT Security Systems by Marcin Korytkowski
94 pages
Document From Aparnasoddy
No ratings yet
Document From Aparnasoddy
36 pages
Machine Learning For Cybersecurity Threat Detectio
No ratings yet
Machine Learning For Cybersecurity Threat Detectio
7 pages
Sok: Realistic Adversarial Attacks and Defenses For Intelligent Network Intrusion Detection
No ratings yet
Sok: Realistic Adversarial Attacks and Defenses For Intelligent Network Intrusion Detection
31 pages
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
No ratings yet
Machine Learning Methods For Secure Internet of Things Against Cyber Threats Synopsis
5 pages
Data Fingerprinting and Visualization For AI-Enhanced Cyber-Defence Systems
No ratings yet
Data Fingerprinting and Visualization For AI-Enhanced Cyber-Defence Systems
12 pages
Machine Learning and Deep Learning Methods For Cybersecurity Ijariie24911
No ratings yet
Machine Learning and Deep Learning Methods For Cybersecurity Ijariie24911
4 pages
7 IJCIS+Final
No ratings yet
7 IJCIS+Final
5 pages
03 02 Lessonarticle
No ratings yet
03 02 Lessonarticle
5 pages
Adversarial Attacks and Defenses in Deep Learning
No ratings yet
Adversarial Attacks and Defenses in Deep Learning
39 pages
Artificial Intelligence and Machine Learning As A Double-Edge Sword in Cyber World
No ratings yet
Artificial Intelligence and Machine Learning As A Double-Edge Sword in Cyber World
5 pages
How Artificial Intelligence Transforms Cybersecurity
No ratings yet
How Artificial Intelligence Transforms Cybersecurity
3 pages
Applications of Artificial Intelligence To Network
No ratings yet
Applications of Artificial Intelligence To Network
17 pages
Uncertainty in Modeling
No ratings yet
Uncertainty in Modeling
25 pages
Ig 19 022
No ratings yet
Ig 19 022
16 pages
1-s2.0-S1084804520303106-main
No ratings yet
1-s2.0-S1084804520303106-main
2 pages
A Review of AI Based Threat Detection Enhancing Network Security With Machine Learning
No ratings yet
A Review of AI Based Threat Detection Enhancing Network Security With Machine Learning
9 pages
Cybersecurity Data Science: An Overview From Machine Learning Perspective
No ratings yet
Cybersecurity Data Science: An Overview From Machine Learning Perspective
29 pages
LCD TV/DVD: Service Manual Circuit Diagrams
No ratings yet
LCD TV/DVD: Service Manual Circuit Diagrams
31 pages
Cybersecurity in Big Data Era From Securing Big Data To Data-Driven Security
No ratings yet
Cybersecurity in Big Data Era From Securing Big Data To Data-Driven Security
18 pages
The Next Generation Cognitive Security O PDF
No ratings yet
The Next Generation Cognitive Security O PDF
22 pages
en Safety Manual VEGASWING 61 63 Two Wire (8 16 MA) With SIL
No ratings yet
en Safety Manual VEGASWING 61 63 Two Wire (8 16 MA) With SIL
20 pages
Articles (Zaki)
No ratings yet
Articles (Zaki)
2 pages
Machine Learning and Deep Learning 2nd Review1
No ratings yet
Machine Learning and Deep Learning 2nd Review1
8 pages
Schema Masina de Spalat Indesit
100% (2)
Schema Masina de Spalat Indesit
31 pages
10 1109@TSC 2019 2907247
No ratings yet
10 1109@TSC 2019 2907247
18 pages
HFS: Server, An Edited Version
No ratings yet
HFS: Server, An Edited Version
53 pages
ISO (International Organization Standardization)
100% (1)
ISO (International Organization Standardization)
18 pages
Channel Partner Registration Form
No ratings yet
Channel Partner Registration Form
2 pages
Workshop File
0% (1)
Workshop File
16 pages
Window 7 Pro +
No ratings yet
Window 7 Pro +
15 pages
GSTN Informatin Booklet
No ratings yet
GSTN Informatin Booklet
100 pages
Object Oriented Programming - ABAP Oops-Abap - 1
No ratings yet
Object Oriented Programming - ABAP Oops-Abap - 1
8 pages
DPS5020 Operating Manual
No ratings yet
DPS5020 Operating Manual
9 pages
Interview Questions
No ratings yet
Interview Questions
50 pages
Scam GPT Ai Robot
No ratings yet
Scam GPT Ai Robot
15 pages
Cantina Centrifuge CFG February March2025
No ratings yet
Cantina Centrifuge CFG February March2025
10 pages
E-Commerce Project
No ratings yet
E-Commerce Project
26 pages
BRO Software
No ratings yet
BRO Software
28 pages
Design and Fabrication of Compact Bicycle Trolley
No ratings yet
Design and Fabrication of Compact Bicycle Trolley
7 pages
CV Syllabus
No ratings yet
CV Syllabus
3 pages
KIDNAPPERS AND ROBBERS THREAT-ALERT INTELLIGENT SYSTEM 2 Unical Conference
No ratings yet
KIDNAPPERS AND ROBBERS THREAT-ALERT INTELLIGENT SYSTEM 2 Unical Conference
13 pages
Assignment Sum22
No ratings yet
Assignment Sum22
4 pages
Silicon N-Channel Power MOSFET: General Description
No ratings yet
Silicon N-Channel Power MOSFET: General Description
10 pages
Free AI Tools To Boost Task Productivity and Work Efficiency
No ratings yet
Free AI Tools To Boost Task Productivity and Work Efficiency
3 pages
A Design of Rectangular Linear Polarized Microstrip Patch Antenna at 1 GHZ
No ratings yet
A Design of Rectangular Linear Polarized Microstrip Patch Antenna at 1 GHZ
3 pages
EAadhaar 0648019028606520240216115645 26022024194147
No ratings yet
EAadhaar 0648019028606520240216115645 26022024194147
1 page
Prototype CNC Machine Design PDF
No ratings yet
Prototype CNC Machine Design PDF
6 pages
Investigating and Ranking The Rate of Penetration (ROP) Features For Petroleum Drilling Monitoring and Optimization
No ratings yet
Investigating and Ranking The Rate of Penetration (ROP) Features For Petroleum Drilling Monitoring and Optimization
7 pages
A Spatial Data Structure For Fast Poisson-Disk Sample Generation
No ratings yet
A Spatial Data Structure For Fast Poisson-Disk Sample Generation
6 pages
Berkeley DCM - 25.03.2022
No ratings yet
Berkeley DCM - 25.03.2022
2 pages
DC Charging TCP/IP (Optional) Micro Usb (Optional) USB Link: Realtime T502
No ratings yet
DC Charging TCP/IP (Optional) Micro Usb (Optional) USB Link: Realtime T502
1 page