Machine Learning For Synthetic Data Generation: A Review
Machine Learning For Synthetic Data Generation: A Review
8, AUGUST 2021 1
Abstract—Machine learning heavily relies on data, but real- the development and application of machine learning technol-
world applications often encounter various data-related issues. ogy [4]. As the field continues to advance, addressing these
These include data of poor quality, insufficient data points leading obstacles will be essential for unlocking the full potential of
arXiv:2302.04062v9 [cs.LG] 30 Jun 2024
paper’s main contributions as follows: computer vision, manual labeling is often necessary [52], [53].
• We present pertinent ideas and background information Tasks such as segmentation, depth estimation, and optical flow
on synthetic data, serving as a guide for researchers estimation can be exceedingly challenging to label manually.
interested in this domain. Synthetic data has emerged as a transformative solution in this
• We explore different real-world application domains and context, significantly improving the labeling process [54].
emphasize the range of opportunities that GANs and Sankaranarayanan et al. introduced a generative adversarial
synthetic data generation can provide in bridging gaps network (GAN) that narrows the gap between embeddings in
(Section II). the learned feature space, facilitating Visual Domain Adap-
• We examine a diverse array of deep neural network tation [55]. This approach enables semantic segmentation
architectures and deep generative models dedicated to across different domains. The GAN uses a generator to project
generating high-quality synthetic data, present advanced features onto the image space, which the discriminator subse-
generative models, and outline potential avenues for fu- quently operates on. Adversarial losses can be derived from
ture research (Section III and IV). the discriminator’s output [56]. Notably, applying adversarial
• We address privacy and fairness concerns, as sensitive losses to the projected image space has been shown to yield
information can be inferred from synthesized data, and significantly better performance compared to applying them
biases embedded in real-world data can be inherited. directly to the feature space [55].
We review current technological advancements and their In a recent study, a Microsoft research team demonstrated
limitations in safeguarding data privacy and ensuring the the effectiveness of synthetic data in face-related tasks by
fairness of synthesized data (Section V and VI). combining a parametric 3D face model with an extensive
• We outline several general evaluation strategies to assess library of hand-crafted assets [57]. This approach rendered
the quality of synthetic data (Section VII). training images with remarkable realism and diversity. The
• We identify challenges faced in generating synthetic data researchers trained machine learning systems for tasks such
and during the deployment process, highlighting poten- as landmark localization and face parsing using synthetic
tial future work that could further enhance functionality data, showing that it can achieve comparable accuracy to real
(Section VIII). data. Furthermore, synthetic data alone proved sufficient for
detecting faces in unconstrained settings [57].
II. A PPLICATION
B. Voice
Synthetic data offers a multitude of compelling advantages,
The field of synthetic voice is at the forefront of tech-
making it a highly appealing option for a wide range of
nological advancement, and its evolution is happening at a
applications. By streamlining the processes of training, testing,
breakneck pace. With the advent of machine learning and
and deploying AI solutions, synthetic data facilitates more
deep learning, creating synthetic voices for various applica-
efficient and effective development. Furthermore, this cutting-
tions such as video production, digital assistants, and video
edge technology reduces the risk of exposing sensitive infor-
games [58] has become easier and more accurate. This field
mation, thereby ensuring customer security and privacy [4].
is an intersection of diverse disciplines, including acoustics,
As researchers transition synthetic data from the lab to prac-
linguistics, and signal processing. Researchers in this area
tical implementations, its real-world applications continue to
continuously strive to improve synthetic voices’ accuracy and
broaden. This section explores several notable domains where
naturalness. As technology advances, we can expect to see
synthetic data generation substantially impacts addressing real-
synthetic voices become even more prevalent in our daily lives,
world challenges.
assisting us in various ways and enriching our experiences in
many fields [59].
A. Vision The earlier study includes spectral modeling for statis-
Supervised learning relies heavily on the availability of la- tical parametric speech synthesis, in which low-level, un-
beled data [51]. However, in many applications, particularly in transformed spectral envelope parameters are used for voice
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 3
TABLE I
S UMMARIZATION OF REPRESENTATIVE WORKS IN SYNTHETIC DATA GENERATION .
synthesis. The low-level spectral envelopes are represented C. Natural Language Processing (NLP)
by graphical models incorporating multiple hidden variables,
The increasing interest in synthetic data has spurred the
such as restricted Boltzmann machines and deep belief
development of a wide array of deep generative models in
networks (DBNs) [60]. The proposed conventional hidden
the field of natural language processing (NLP) [51]. In recent
Markov model (HMM)-based speech synthesis system can
years, a multitude of methods and models have illustrated
be significantly improved in terms of naturalness and over-
the capabilities of machine learning in categorizing, routing,
smoothing [61].
filtering, and searching for relevant information across various
domains [64].
Synthetic data can also be applied to Text-to-Speech (TTS) Despite these advancements, challenges remain. For exam-
to achieve near-human naturalness [62], [63]. As an alternative ple, the meaning of words and phrases can change depending
to sparse or limited data, synthetic speech (SynthASR) was on their context, and homonyms with distinct definitions can
developed for automatic speech recognition. The combination pose additional difficulties [65]. To tackle these challenges, the
of weighted multi-style training, data augmentation, encoder BLEURT model was proposed, which models human judg-
freezing, and parameter regularization is also employed to ments using a limited number of potentially biased training
address catastrophic forgetting. Using this novel model, the examples based on BERT. The researchers employed millions
researchers were able to apply state-of-the-art techniques to of synthetic examples to develop an innovative pre-training
train a wide range of end-to-end (E2E) automatic speech scheme, bolstering the model’s ability to generalize [66].
recognition (ASR) models while reducing the need for pro- Experimental results indicate that BLEURT surpasses its coun-
duction data and the costs associated with it [62]. terparts on both the WebNLG Competition dataset and the
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 4
privacy and security protocols [104]. In the past, institutions Synthetic trajectory generation is frequently combined with
possessing extensive data repositories could potentially assist privacy-enhancing techniques to further prevent sensitive infer-
decision-makers in resolving a broad spectrum of issues. ence from the synthesized data. For example, Chen et al. [111]
However, accessing such data, even for internal purposes, was introduces an N-gram-based method to predict the following
hindered by confidentiality concerns. Presently, companies are position based on previous positions for publishing trajectory.
harnessing synthetic data to refresh and model original data, They exploit the prefix tree to describe the n-gram model while
generating continuous insights that contribute to enhancing the combining it with differential privacy [112]. [113] extends the
organization’s performance [4]. n-gram model with local differential privacy and [114] further
replaces the n-gram model with key movement mobility for
F. Education differentially private trajectory generation. By comparison,
[115] proposes a synthetic trajectory strategy based on the
Synthetic data is gaining increasing attention in the field of
discretization of raw trajectories using hierarchical reference
education due to its vast potential for research and teaching.
systems to capture individual movements at differing speeds.
Synthetic data refers to computer-generated information that
Their method adaptively selects a small set of reference
mimics the properties of real-world data without disclosing
systems and constructs prefix tree counts with differential
any personally identifiable information [105]. This approach
privacy. Applying direction-weighted sampling, the decrease in
proves instrumental for educational settings, where ethical
tree nodes reduces the amount of added noise and improves the
constraints often limit the use of real-world student data.
utility of the synthetic data. [116] constructs the differentially
Therefore, synthetic data offers a robust solution for privacy-
private prefix tree and calibrates original trajectories against a
concerned data sharing and analysis, enabling the creation
selection of anchor points. By extracting multiple differential
of accurate models and strategies to improve the teaching-
private distributions with redundant information [117], [118],
learning process.
the authors generate a new trajectory with samples from these
A detailed example of synthetic data usage in education is
distributions. By comparison, [119] estimates various distri-
the simulation of student performance data to aid in designing
butions of an attribute set to determine trajectories and [120]
teaching strategies. Suppose an educational researcher wants
consider the interactions between different attributes by group-
to investigate the impact of teaching styles on student per-
ing strongly correlated attributes into non-disjoint sets and
formance across different backgrounds and learning abilities.
constructing a corresponding distribution for each set.
However, obtaining real student data for such studies can be
In addition to differential privacy, Bindschaedler and
ethically complex and potentially intrusive. In such a situation,
Shokri [121] enforce plausible deniability to generate privacy-
synthetic data can be generated that mirrors the demographic
preserving synthetic traces. It first introduces trace similarity
distributions, learning patterns, and likely performance of
and intersection functions that map a fake trace to a real hint
a typical student population. This data can then be used
under similarity and intersection constraints. Then, it generates
to model the effects of various teaching strategies without
one fake trace by clustering the locations and replacing the
compromising student privacy [106].
trajectory locations with those from the same group. If the
Furthermore, synthetic data can be a powerful tool in teacher
fake trace satisfies plausible deniability, i.e., there exist k other
training programs. For example, teacher candidates can use
real traces that can map to the fake trace, then it preserves the
synthetic student data to practice data-driven instructional
privacy of the seed trace. While existing studies mainly use the
strategies, including differentiated instruction and personalized
Markov chain model, [122] proposes PrivTrace, which controls
learning plans. They can analyze this synthetic data, identify
the space and time overhead by the first-order Markov chain
patterns, determine student needs, and adjust their instructional
model and achieves good accuracy for next-step prediction
plans accordingly. By using synthetic data, teacher candidates
by the second-order Markov chain model. [123] considers the
gain practical experience in analyzing student data and adapt-
location synthesizer that generates location traces, including
ing their teaching without infringing on the privacy of actual
co-locations of friends, while offering node-level differential
students [107]. Thus, synthetic data serves as a valuable bridge
privacy for the friendship and user-level differential privacy
between theory and practice in education, driving innovation
for the co-location count matrix.
while safeguarding privacy.
H. AI-Generated Content (AIGC)
G. Location and Trajectory Generation AI-Generated Content (AIGC) stands at the forefront of
Location and trajectory are a particular form of data that the technology and content creation industry, changing the
could highly reflect users’ daily lives, habits, home addresses, dynamics of content production. A typical example of AIGC is
workplaces, etc. To protect location privacy, synthetic location OpenAI’s ChatGPT, an AI-driven platform generating human-
generation is introduced as opposed to location perturba- like text in response to prompts or questions. It leverages a vast
tion [108]. The main challenge of generating synthetic location corpus of internet text to generate detailed responses, often
and trajectory data is to resemble genuine user-produced data indistinguishable from those a human writer would produce.
while offering practical privacy protection simultaneously. One This capacity extends beyond simple question-answer pairs to
approach to generating the location and trajectory data is crafting whole articles, stories, or technical explanations on a
to inject a synthetic point-based site within a user’s trajec- wide range of topics, thus creating a novel way of producing
tory [109], [110]. blog posts, articles, social media content, etc [124], [125].
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 6
Google’s Project Bard focuses more on the creative aspects resources, synthetic data generations can model workforce
of text generation. It is designed to generate interactive fiction dynamics, including employee performance, engagement, and
and assist in storytelling. Users can engage in an interactive turnover. By analyzing these models, businesses can develop
dialogue with the model, directing the course of a narrative by strategies to improve employee satisfaction, enhance produc-
providing prompts that the AI responds to, thus co-creating a tivity, and reduce turnover rates. For example, synthetic data
story. This opens up fascinating possibilities for interactive generations can simulate the impact of various HR policies on
entertainment and digital storytelling [126]. workforce morale and performance, helping HR departments
An innovative application of AIGC is in the field of news to implement the most effective practices.
reporting. News agencies increasingly use AI systems, such
as the GPT series, to generate news content. For instance, J. Other Applications
the Associated Press uses AI to generate news articles about The techniques for synthetic data generation described in
corporate earnings automatically. The AI takes structured this paper have far-reaching implications beyond the specific
data about company earnings and transforms it into a brief, domains covered. Here are some notable applications:
coherent, and accurate news report. This automation allows • Retail and Marketing: In retail, synthetic data can model
the agency to cover many companies that would be possible customer interactions, purchasing behaviors, and inven-
with human journalists alone [127]. tory management [132]. This aids in developing per-
Additionally, AIGC has found its place in the creative sonalized marketing strategies, optimizing supply chains,
domain, with AI systems being used to generate book de- and improving customer service without infringing on
scriptions, plot outlines, and even full chapters of novels. individual privacy.
For instance, a novelist could use ChatGPT to generate a • Environmental Studies: Synthetic data can simulate en-
synopsis for their upcoming book based on a few keywords vironmental conditions, weather patterns, and ecological
or prompts related to the story. Similarly, marketing teams interactions [133]. This is particularly useful for studying
utilize AI to create compelling product descriptions for online climate change, biodiversity, and conservation efforts, al-
marketplaces [128]. This increases efficiency and provides a lowing researchers to test hypotheses and model practical
level of uniformity and scalability that would be challenging scenarios without the constraints of limited real-world
to achieve with human writers alone. Looking forward, AIGC data.
is profoundly impacting the landscape of content creation and • Urban Planning and Development: In urban planning,
will continue to shape it in the future [126]. synthetic data can be used to simulate population growth,
traffic flows, and infrastructure development [134]. This
helps city planners and developers make informed deci-
I. Finance
sions about resource allocation, transportation systems,
Synthetic data generation offers significant benefits for the and sustainable development initiatives.
finance industry [102], as detailed below. First, financial data • Software Development and Testing: In software develop-
is highly sensitive and subject to stringent privacy regula- ment, synthetic code generation can simulate various cod-
tions [129]. Synthetic data mimics real data without exposing ing scenarios, bug patterns, and software behaviors [135].
actual customer information, enabling institutions to comply This is particularly useful for testing and debugging, as it
with privacy laws while still utilizing detailed datasets for allows developers to identify and fix potential issues with-
analysis and development. Second, synthetic data can be used out the constraints of existing codebases. Synthetic code
to test and validate financial algorithms and models under can also aid in developing personalized coding assistants,
various conditions. For example, trading algorithms can be optimizing software performance, and improving the reli-
tested using synthetic market data to evaluate their perfor- ability of code releases [136]. Additionally, by generating
mance under different market scenarios, including rare or diverse and extensive code samples, developers can en-
extreme events that may not be present in historical data [130]. hance machine learning models for code completion and
Third, developing and testing financial algorithms requires error detection, ultimately leading to more efficient and
large volumes of high-quality data. Synthetic data provides an robust software development processes.
endless supply of training data, enabling thorough backtesting
of trading strategies and machine learning models without the III. D EEP N EURAL N ETWORK
risk of overfitting historical data [131]. It is no secret that deep neural networks have become
Synthetic data generation also transforms the financial ser- increasingly prominent in the field of computer vision and
vices industry by enabling more accurate risk assessments and in other areas. Nevertheless, they require large amounts of
fraud detection [102]. Synthetic data generation can identify annotated data for supervised training, which limits their
anomalies and potential risks by simulating financial trans- effectiveness [137]. In this section, we review and compare
actions and market behaviors, allowing financial institutions various commonly-used deep neural network architectures as
to implement more effective fraud prevention measures and background knowledge, including multiple layer perception
develop more resilient financial strategies. Furthermore, syn- (MLP) in Section III-A, convolutional neural network (CNN)
thetic data generation can support compliance with regulatory in Section III-B, recurrent neural network (RNN) in Sec-
requirements by providing detailed, real-time reporting and tion III-C, graph neural network (GNN) in Section III-D and
analysis of financial activities [47]. In the context of human transformer in Section III-E.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 7
A. Multiple Layer Perception (MLP) graph [145], [146]. The feedforward rule of graph neural
Multiple-layer perception (MLP) is a classic feedforward network can be formulated as
artificial neural network that uses a fully connected connection (l+1) (l)
hi = Aggregatej∈N (i) (hj , mj , mji ),
in a single layer. It models all the possible interactions between (3)
l = 1, · · · , L,
the features. It is also the most used neural network. In the
i-th layer, the propagation can be written as (l)
where hi denotes the embedding of the i-th node at the l-
hi+1 = σ(WT hi + b), (1) th layer, mj denotes the node feature of the node j, mji
denotes the edge feature of the edge that connects nodes j and
where hi is the input and hi+1 is the output (which is also i, N (i) is the set of nodes that connect with i. Within each
the input of the i + 1-th layer), W is weight matrix, b is layer, we update the current representation via aggregating the
the bias vector, both W and b are parameters of MLP. σ() information from (1) representation from the previous layer,
denotes the activation function. Popular activation functions (2) node feature (3) edge feature. For example, in chemical
incorporate the sigmoid, ReLU, tanh, etc. The goal of the compounds, each node corresponds to an atom, and the node
activation function is to provide nonlinear transformation. feature is the category of the atom, e.g., Carbon atom, Nitrogen
MLP is the basis of many neural network architectures. Please atom, Oxygen atom; each edge is a chemical bond, and the
refer to [138] for more details about MLP. edge feature is the type of the bond, including single bond,
double bond, triple bond, and aromatic bond.
B. Convolutional Neural Network (CNN)
Convolutional neural network (CNN) was proposed to learn E. Transformer
a better representation of images [138]. The core idea of the The Transformer architecture, introduced by Vaswani et al.
convolutional neural network is to design a two-dimensional in the groundbreaking paper “Attention is All You Need” in
convolutional layer that slides over the image to model the 2017, revolutionized the field of natural language processing
small-sized patch horizontally and vertically. Convolutional and machine learning. Unlike traditional sequential models
neural networks with one-dimensional convolutional layers like RNNs (Recurrent Neural Networks) and LSTMs (Long
can also model sequence data. Please refer to [138] for more Short-Term Memory), the Transformer model leverages a
details about CNN. unique mechanism called self-attention, which allows it to cap-
ture long-range dependencies and relationships within the in-
C. Recurrent Neural Network (RNN) put data more effectively. This architecture consists of a stack
The sequence is one of the most popular data formats in of identical layers, each containing a multi-head self-attention
real-world applications, such as natural language and speech mechanism followed by position-wise fully connected feed-
signals. Recurrent neural network (RNN) was designed to forward networks. By eschewing recurrent or convolutional
represent and generate sequence data [139]. Suppose we have layers, the Transformer model is highly parallelizable and
a sequence of length T , X = [x1 , x2 , ..., xT ], xt denotes the computationally efficient, leading to faster training times and
t-th token, recurrent neural network designs a recursive (or improved performance on various NLP tasks.
recurrent) mechanism to reuse the neural network module. The Generative Pre-trained Transformer (GPT) is a cutting-
Specifically, at the t-th step, we have edge deep learning model that has revolutionized natural lan-
guage processing (NLP) tasks [147]. Developed by OpenAI,
ot , ht = RNN(xt , ht−1 ), (2) GPT is an autoregressive transformer-based model that has
displayed unparalleled performance in tasks such as text gener-
where ht−1 denotes the hidden state at the t − 1-th step and is
ation, translation, text summarization, and question-answering.
used to represent all the historical memory from previous t−1
The model’s architecture consists of multiple self-attention
tokens (x1 , x2 , ..., xt−1 ); ot denotes the output of the RNN
mechanisms and position-wise feedforward layers, enabling it
at the t-th step. The RNN module is recursively reused for
to capture long-range dependencies and generate highly coher-
each token. However, the vanilla RNN structure suffers from
ent and contextually relevant text. The key to GPT’s success
gradient vanishment issues, especially for long sequences. The
lies in its unsupervised pre-training on vast amounts of textual
state-of-the-art RNN architectures long short-term memory
data, followed by fine-tuning on specific tasks. As the GPT
(LSTM) [140] and gated recurrent unit (GRU) [141] design
model series progresses, with GPT-3 being the latest version at
a specialized gate to save the memory to alleviate the issue.
the time of this writing, the size and capabilities of the model
continue to grow, paving the way for increasingly sophisticated
D. Graph Neural Network (GNN) NLP applications and opening up new possibilities for the
Many graph-structured data exist in downstream appli- generation of synthetic data.
cations, such as social networks, brain networks, chemical By leveraging its pre-training on massive datasets and fine-
molecules, and knowledge graphs. Graph neural network was tuning for specific tasks, GPT can produce artificial data
proposed to model graph-structured data and learn the topo- samples that closely resemble real-world data. This capability
logical structure from graph [142]–[144]. Specifically, graph is particularly valuable in scenarios where access to real data
neural networks build the interaction between the connected is limited due to privacy, regulatory, or resource constraints.
nodes and edges to model the topological structure of the GPT-generated synthetic data can be used to augment existing
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 8
TABLE II
C OMPARISON OF ALL THE GENERATIVE AI METHODS FROM DIFFERENT ASPECTS .
datasets, enabling researchers and practitioners to build more B. Self-Supervised Learning (SSL)
robust and accurate machine learning models while mitigating Labeled data are expensive to acquire so the number of
the risks associated with using sensitive or private data. Addi- available labeled data is usually limited. To address this issue,
tionally, the synthetic data generated by GPT models can help self-supervised learning (SSL) was proposed. This learning
address challenges related to data scarcity, class imbalance, or paradigm curates the supervision signal from the data itself.
the need for domain-specific data, ultimately contributing to It is parallel to supervised learning and unsupervised learning.
developing and deploying more effective AI solutions across Different from supervised learning, self-supervised learning
various domains and applications. can learn from massive unlabeled data. Self-supervised learn-
ing is usually used as a pretraining strategy to learn the
IV. G ENERATIVE AI representation from massive unlabelled data [155]. The core
idea of self-supervised learning is to mask a subset of the raw
Generative AI models refer to a wide class of AI methods data feature and build a machine learning model to predict
that could learn the data distribution from existing data objects the masked data. then the pre-trained machine learning model
and generate novel structured data objects, which fall into (usually a neural network) is used as a “warm start”, and is
the category of unsupervised learning. Generative AI models, furtherly finetuned for the downstream applications.
also known as deep generative models, or distribution learning
methods, learn the data distribution and samples from the C. Variational Autoencoder (VAE)
learned distribution to produce novel data objects. In this
Variational autoencoder (VAE) [149] employs a continuous
section, we investigate several generative AI models that
latent variable to characterize the data distribution. Specifi-
are frequently used in synthetic data generation, including
cally, it contains two neural network modules: encoder and
the language model in Section IV-A, variational autoencoder
decoder. The objective of the encoder is to convert the data
(VAE) in Section IV-C, generative adversarial network (GAN)
object into a continuous latent variable. Then decoder takes
in Section IV-D, reinforcement learning (RL) in Section IV-E,
the latent variable as the input feature and reconstructs the
and diffusion model in Section IV-F. Table II compares various
data object.
generative AI methods from several aspects.
Formally, suppose the data object is denoted x, the latent
variable is a d-dimensional real-valued vector z, the encoder
A. Language Model is p(z|x), and the decoder is q(x|z). The learning objective
contains two parts: (1) reconstruct the data object x and (2)
The language model was originally designed to model
encourage the distribution of latent variables to be close to the
natural language. It is able to learn structured knowledge from
normal distribution.
massive unlabelled sequence data. Specifically, suppose the
The Kullback-Leibler (KL) divergence measures the differ-
sequence has N tokens, denoted X = [x1 , · · · , xN ], then the
ence between two probability distributions. Given two prob-
probability distribution of the sequence can be decomposed as
ability distributions p1 (x) and p2 (x) on the same continuous
the product of a series of conditional probabilities,
domain, KL divergence between them is formally defined as
N
Y
p(X) = p [x1 , · · · , xN ] = p(xi |x1 , · · · , xi−1 ), (4)
Z
p1 (x)
i=1
KL(p1 ||p2 ) = p1 (x) log dx
p2 (x)
Zx
where a single conditional probability p(xi |x1 , · · · , xi−1 ) =
p1 (x) log p1 (x) − log p2 (x) dx.
denote the probability of the token xi given all the tokens x Z
before xi . The conditional probability can be modeled by
log p(x) = log p(z)p(x|z)dz
the recurrent neural network (RNN). The language model z
can be used to generate all types of sequence data, such as
≥ Eq(z|x) log p(x|z) − DKL (q(z|x)||p(z))
natural language [148], electronic health records [154], etc.
≜ ELBO.
The language model can be combined with other deep learning
models, such as variational autoencoder (VAE) and generative where p(z) is the normal distribution and is used as the
adversarial network (GAN), which will be described later. prior distribution. VAE encourages the distribution of latent
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 9
variables to be close to normal distribution. Then during sequential decision-making as a Markov decision process
the inference phase, we sample latent variables from the (MDP) [153]. Markov decision process assumes that given the
normal distribution and generate the novel data objects. There current state, the future state of the stochastic process does not
are several VAE variants, such as disentangled VAE [156], depend on the historical states. Suppose the state at the time
hierarchical VAE [157], and sequence VAE [76]. t is xt , Markov decision process satisfies
The objective of diffusion models is to estimate the vari- V. P RIVACY R ISKS AND P REVENTION
ational lower bound (VLB) of the negative log-likelihood of Open release and free data exchange would benefit research
data distribution: and industry development. However, there are cases where
q(x1:T |x0 ) datasets exist but cannot be publicly disclosed due to privacy
log p(x) ≥ −Eq(x1:T |x0 ) [log ] = −LVLB . concerns. Regulated data, such as clinical and genomics data
pθ (x0:T )
in raw form, may not be shared, and one solution is to share
The VLB can be rewritten as: synthesized data instead.
LVLB = KL[q(xT |x0 )||pθ (xT )]
| {z
LT
} A. Privacy Risks in Data Synthesis
T
X Due to the utility goal of data synthesis, the synthesized
+ KL[q(xt−1 |xt , x0 )||pθ (xt−1 |xt )] −Eq [log pθ (x0 |x1 )] . data tends to preserve the distribution of the original data.
t=2
| {z }| {z }
Lt−1 L0 Therefore, the deployment of these models could be subject
to privacy leakage. For deep neural network-based approaches,
Here LT is a constant and can be ignored, and diffusion membership inference attack [168], [169] would identify if
models [163] have been using a separate model for estimating an input is in the training data or not and thus can be
L0 . For {Lt−1 }Tt=2 , we model a neural network to approximate used to determine how close the synthesized data is to the
the conditionals during the reverse process,i.e.,, we want to original data. At the feature level, sensitive attributes such
βt
train µθ (xt , t) to predict √1αt xt − √1− ᾱt
ϵ . If we plug this as skin color can be inferred from the behavior of the deep
into the closed-form solution of the KL-divergence between learning model [170], and even the single training instance
two multivariate Gaussian distributions, we will have the can be reconstructed [171]–[173]. For generative AI models,
following for t = 1, · · · , T − 1: the generative learning process and the high complexity of
h √ √ i the model jointly encourage a distribution that is concentrated
Lt = Ex0 ,z ∥ϵt − ϵθ ( ᾱt x0 + 1 − ᾱt ϵt , t)∥2 . around training samples. By repeatedly sampling from the
distribution, there is a considerable chance of recovering the
The diffusion model has achieved wide success in many training samples or attributes [174]–[180], or the membership
downstream synthetic problems [80], [164]–[166]. As a sum- of the training data [181].
marization, Table II compares various generative AI methods
from several aspects.
B. Privacy Protection in Data Synthesis
Solutions have been proposed in two broad categories. In the
G. Multimodal Learning first category, different data anonymization-based approaches
such as K-anonymity [217]–[219] and nearest marginal [220]
Multimodal data refers to datasets that integrate multiple to sanitize data so that it cannot be easily re-identified. These
types of data, such as text, images, audio, and numerical data anonymization approaches involve replacing sensitive
values. This type of data provides a comprehensive view by data with fictitious yet realistic data. It is often used to
combining different sources of information, which is crucial protect the data while maintaining its usability for testing or
for tasks requiring a holistic understanding of complex scenar- development purposes. However, they often do not provide
ios. In fields like healthcare [167], finance, and autonomous rigorous privacy guarantees [14]. In the second category,
systems, multimodal data enables more accurate and robust synthetic data generation approaches have been proposed to
analysis and decision-making by leveraging the strengths of generate realistic synthetic data using rigorous differential
each data type. For instance, in drug discovery, multimodal privacy definitions [112], [182], [221] for various applications.
data can combine genomic data, chemical structures, and These approaches involves adding noise to the data to prevent
clinical outcomes to enhance the prediction of drug efficacy the identification of individuals in the dataset while preserving
and safety [93]. the statistical properties of the data. This is particularly useful
Synthetic multimodal data generation involves creating ar- in scenarios where data needs to be shared but individual
tificial datasets that integrate multiple types of data, such as privacy must be maintained. In particular, Bindschaedler et
text, images, audio, and numerical data, to simulate real-world al. [182] introduced the idea of plausible deniability instead of
scenarios. This technique is particularly valuable in fields like directly adding noise to the generative model. This mechanism
healthcare [16], finance [102], and education systems [107], results in input indistinguishability that means by observing
where data is often complex and heterogeneous. the output set (i.e., synthetics) an adversary cannot make sure
Then, we review some cutting-edge techniques for synthetic whether a particular data record was in the input set (i.e., real
multimodal data generation. GANs can generate one type of data). With the help of generative modeling, Acs et al. [43]
data from another, such as images from textual descriptions or clusters the original datasets into k clusters with differentially
audio from images. This cross-modal generation capability is private kernel k-means and produce synthetic data for each
essential for creating cohesive multimodal datasets [11]. Re- cluster. By comparison, Liu et al [213] introduce two-level
cently, ChatGPT [127] supports multimodal data generation, privacy-preserving synthetic data generation. At the data level,
including image, text, and numerical features. a selection module is used to select the items which contribute
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 11
TABLE III
S UMMARIZATION OF PRIVACY PREVENTION STRATEGIES IN SYNTHETIC DATA GENERATION .
less to the user’s preference. At the item level, a synthetic item trained the discriminator under differentially private SGD,
generation module is developed to create the corresponding which generates plausible individuals of clinical datasets.
synthetic item. Tseng and Wu [183] apply compressive privacy [222] for
Taking advantage of the GAN, several methods are proposed CPGAN, which would generate compressing representations
to generate synthetic data to get better effect [27], [49], that retain high utility. Jordon et al. [49] modifies the
[184], [185], [191], [195], [196] which closely matches the Private Aggregation of Teacher Ensembles (PATE) framework
distribution of the source data than the hidden Markov model- and applies it to the discriminator of GANs. The proposed
based approach [71], RBF based approach [202], Bayesian approach perceives the discriminator as a classifier and utilizes
network-based [212], and Auto-encoder based approach [14]. its output as knowledge such that the student learns from
Xie et al [185] propose DPGAN by adding noise on the noisy labels that are obtained through privately aggregating
gradient of the Wasserstein distance with respect to the training the discriminator’ votes. This allows a tight bound on the
data. This approach does not adopt the optimization strategy influence of any individual sample on the model, resulting
to improve the training stability and convergence speed. To in tight differential privacy guarantees and thus an improved
address these problems, Zhang et al. [184] proposed dp- performance over models for data synthesis. By comparison,
GAN, a general private data publishing framework for rich Long et at. [197] applies teacher-student-based differential
semantic data without the requirement of tag information privacy to the generator. While most of these approaches inject
compared to [191]. By comparison, Beaulieu-Jones et al. [191] noise into the energy function, a differentially private GAN
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12
called GANobfuscator [186] achieve differential privacy by processing via resampling from the synthetic data to filter out
adding noise within the training procedure. samples that do not meet the selected utility measures, thus
While centralized differential privacy assumes data aggrega- improving the utility of synthetic data.
tors are reliable, local differential privacy (LDP) [221] assumes
that aggregators cannot be trusted and relies on data providers C. Privacy Threats in Foundation Models
to perturb their own data and is used to generate private
synthetic datasets that is similar to the private dataset. [215] Entering the era of foundation models, recent research has
is inspired by PriView [216] but for computing any k-way demonstrated that training data can be exposed from large lan-
marginals under the LDP setting for the marginal table release guage models [229] as well as stable diffusion [230]. In both
problem. Furthermore, [198] considers DP at label-level on types of models, attackers can generate sequences from the
GAN for synthetic spatial point generation. Apart from LDP trained model and identify those memorized from the training
in distributed setting, Triastcyn and Faltings [187] propose set. Studies have shown that a sequence that appears multiple
federated generative privacy that utilizes insufficient local data times in the training data is more likely to be generated
from multiple clients to train a GAN. The method shares only than a sequence that occurred only once [166], [231], [232].
generators that do not come directly into contact with data Accordingly, Kandpal et al [233] propose to deduplicate the
and the discriminator remain private. This model can output training data that appears multiple times such that the privacy
artificial data, not belonging to any real user in particular, but risks in language models is mitigated. [234] is the first work to
coming from the common cross-user data distribution. enforce privacy using differentially private stochastic gradient
These privacy-preserving data synthesis methods mainly descent (DP-SGD) in diffusion models. Several attempts has
aim at structured data like tables, which cannot be applied been made to reduces the noise in the gradient during DP-
to high dimensionality and complexity. To solve this problem, SGD training and improves the generative quality in diffusion
PriView [216] constructs the private k-way marginal tables models, via semantic-aware pretraining [235], [236], latent
for k ≥ 3 by first extracting low-dimensional marginal views information [237], and retrieval-augmented generation [238].
from the flat data and adding noise to the views and then In the meantime, differential privacy has been heavily invested
applying a reprocessing technique to ensure the consistency in privacy protection of large language models [239].
of the noisy views. [223]–[226] leverage copula functions for Given that we are still at the very early stage of the
multi-dimensional differentially private synthesization. Zhang generative foundational models, the potential of the foundation
et al. [206] consider repetitive perturbation of the original models for data synthesis has not been fully explored. While
data as a substitute to the original data with a synthetic more possible privacy threats on the foundation models are yet
data generation technique called PrivBayes. PrivBayes decom- to be discovered, existing privacy measures may be inadequate
poses high dimensional data into low dimensional marginals to meet its demands of privacy. Further investigation is needed
by constructing a Bayesian network and injects noise into to design countermeasures that would mitigate the memoriza-
these learned low dimensional marginals to ensure differential tion and generalization problems for privacy protection.
privacy and the synthetic data is inferred from these noised
marginals. Instead of the Bayesian network, differentially VI. FAIRNESS
private auto-encoder [14] significantly improves the effec- Generating synthetic data that reflect the important under-
tiveness of differentially private synthetic data release. [207] lying statistical properties of the real-world data may also
applies data cleaning method [227] to fix the violations on inherit the bias from data preprocessing, collection, and algo-
the structure of the data in the synthetic data. Instead of rithms [240]. Minority groups can often end up being under-
using graphical models as the summarization/representation represented in synthetic data [241]–[243]. The fairness prob-
of a dataset [14], [182], [184], [208], [209], [210] proposes to lem is currently addressed by three types of methods [244]:
use a set of large number of low-degree marginals to represent (i) preprocessing, which revises input data to remove informa-
a dataset. The advantage of this approach is that it makes tion correlated to sensitive attributes, usually via techniques
weak assumptions about the conditional independence among like massaging, reweighting, and sampling. (ii) in-processing,
attributes, and simply tries to capture correlation relationships which adds fairness constraints to the model learning process;
that are in the dataset. Meanwhile, the method is especially and (iii) post-processing, which adjusts model predictions after
attractive under differential privacy for its straightforward the model is trained.
sensitivity measurement, reduced noise variance, and efficient Most existing fairness-aware data synthesis methods lever-
privacy cost. [199] leverages the Hermite polynomial features age preprocessing techniques. The use of balanced synthetic
to encapsulate a higher degree of information within a smaller datasets created by GANs to augment classification training
order of feature. [201] constructs a graph that explore pairwise has demonstrated the benefits for reducing disparate impact
dependence between attributes and applies the junction tree due to minoritized subgroup imbalance [245]–[247]. [248]
algorithm to obtain the Markov random field (MRF), from models bias using a probabilistic network exploiting structural
which the noisy marginals are generated and the synthetic data equation modeling as the preprocessing to generate a fairness-
are sampled. aware synthetic dataset. Authors in [249] leverage GAN as
While private synthetic data generation algorithms are the pre-processing for fair data generation that ensures the
agnostic to downstream tasks, it is important to meet the generated data is discrimination free while maintaining high
utility requirements for downstream use. [228] proposes post- data utility. By comparison, [250] is geared towards high
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13
dimensional image data and proposes a novel auxiliary clas- Additionally, it struggles with high-dimensional data that
sifier GAN that strives for demographic parity or equality cannot be easily visualized and evaluated by humans.
of opportunity. However, preprocessing would require the 2) Statistical difference evaluation. This strategy involves
synthesized data provider to know all correlations, biases, and calculating various statistical metrics on both the synthe-
distributions of variables in the existing datasets as a priori. sized and real datasets and comparing the results. For ex-
Compared to preprocessing, the latter two categories are less- ample, [53], [255] use first-moment statistics of individ-
developed for fair data synthesis. [251] insert a structural ual features (e.g., medical concept frequency/correlation,
causal model in the input layers of the generator, allowing patient-level clinical feature) to evaluate the quality
each variable to be reconstructed conditioned on its causal of generated electronic health record (EHR) data. The
parents for inference time debiasing. smaller the differences between the statistical properties
In the meantime, differential privacy amplifies the fairness of synthetic and real data, the better the quality of the
issues in the original data [252]. [129] demonstrate that synthesized data.
differential privacy does not introduce unfairness into the data 3) Evaluation using a pre-trained machine learning
generation process or to standard group fairness measures model. As mentioned in Section IV-D, in the generative
in the downstream classification models, but does unfairly adversarial network (GAN), the discriminator differen-
increase the influence of majority subgroups. Differential tiates fake data (synthesized data) from real ones. Con-
privacy also significantly reduces the quality of the images sequently, the output of the discriminator can measure
generated from the GANs, decreasing the synthetic data’s how closely synthetic data resembles real data. The
utility in downstream tasks. To measure the fairness in synthe- performance of the discriminator on the synthesized data
sized data, [92] develops two covariate-level disparity fairness can be used as an indicator of how well the generator
metrics for synthetic data. The authors analyze all subgroups produces realistic data. This strategy can be applied
defined by protected attributes to analyze the bias. not only to GANs but also to other generative models
In the emerging AIGC using foundation models, the gen- where a pre-trained machine learning model is used for
erated images and texts may also inherit the stereotypes, evaluation.
exclusion and marginalization of certain groups and toxic 4) Training on synthetic dataset and testing on the real
and offensive information in the real-world data. This would dataset (TSTR). This strategy involves using synthetic
lead to discrimination and harm to certain social groups. The data to train machine learning models and assessing their
misuse of such data synthesis approaches by misinformation prediction performance on real test data in downstream
and manipulation would lead to further negative social im- applications. High performance on real test data indi-
pact [253]. Given that the quality of the data generated by cates that the synthetic data has successfully captured
foundation models is inextricably linked to the quality of essential characteristics of the real data, making it a
the training corpora, it is essential to regulate the real-world useful proxy for training. For example, [256] employs
data being used to form the data synthesis distribution. While synthetic data to train machine learning models and
reducing bias in data is important, the remaining bias in the assess their prediction performance on real test data
data may also be amplified by the models [244] or the privacy- in downstream applications. TSTR can provide insights
enhancing components [252]. With frequent inspection and into the effectiveness of synthetic data for training
sensitive and toxic information removal on both data and machine learning models in a wide range of tasks and
model, it will help govern the information generated from domains.
those foundation models and ensure the models would do no 5) Application-specific evaluation. Depending on the spe-
harm. cific use case or domain, tailored evaluation methods
may be employed to assess the quality of synthesized
VII. E VALUATION S TRATEGY data. These evaluation methods can consider the unique
requirements or constraints of the application, such
In this section, we discuss various approaches to evaluating as regulatory compliance, privacy concerns, or specific
the quality of synthesized data, which is essential for deter- performance metrics. By evaluating the synthesized data
mining the effectiveness and applicability of synthetic data in the context of its intended use, a more accurate as-
generation methods in real-world scenarios. We categorize sessment of its quality and applicability can be obtained.
these evaluation strategies as follows:
1) Human evaluation. This method is the most direct way
to assess the quality of synthesized data. Human evalua-
tion involves soliciting opinions from domain experts or These evaluation strategies offer various ways to gauge the
non-expert users to judge the synthesized data’s quality, quality of synthesized data, helping researchers and practition-
similarity to real data, or usability in specific appli- ers determine the effectiveness of synthetic data generation
cations. For example, in speech synthesis, the human methods and their applicability in real-world scenarios. Em-
evaluator rates the synthesized speech and real human ploying a combination of these strategies can provide a more
speech in a blind manner [44], [254]. However, human comprehensive understanding of the strengths and weaknesses
evaluation has several drawbacks, including being ex- of the synthesized data, facilitating further improvements in
pensive, time-consuming, error-prone, and not scalable. synthetic data generation techniques [257].
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 14
VIII. C HALLENGES AND O PPORTUNITIES foundation models offer promising opportunities to bolster
cybersecurity defenses. AI-driven anomaly detection systems
The aim of this research is to present a comprehensive sur- can leverage generative models to simulate various attack
vey of synthetic data generation—a promising and emerging scenarios, improving their ability to recognize and mitigate
technique in contemporary deep learning. This survey outlines real-world threats. In the meantime, the quest for transparency
current real-world applications and identifies potential avenues and interpretability in generative models promotes research
for future research in this field. The utilization of synthetic data into explainable AI. By proactively addressing these machine
has been proven effective across a diverse array of tasks and learning risks, synthetic data generation can evolve to deliver
domains [9]. In this section, we delve into the challenges and more ethical, secure, and transparent solutions, ultimately
opportunities presented by this rapidly evolving area. harnessing its full potential to benefit society while mitigating
First and foremost, evaluation metrics for synthetic data its associated risks.
are essential to determine the reasonableness of the generated In general, the use of synthetic data is becoming a viable
data. In industries like healthcare, where data quality is of alternative to training models with real data due to advances
paramount importance, clinical quality measures and evalu- in simulations and generative models. However, a number of
ation metrics are not always readily available for synthetic open challenges need to be overcome to achieve high perfor-
data. Clinicians often struggle to interpret existing criteria mance. These include the lack of standard tools, the difference
such as probability likelihood and divergence scores when between synthetic and real data, and how much machine
assessing generative models [68]. Concurrently, there is a learning algorithms can do to exploit imperfect synthetic data
pressing need to develop and adopt specific regulations for effectively. Though this emerging approach is not perfect now,
the use of synthetic data in medicine and healthcare, ensuring with models, metrics, and technologies maturing, we believe
that the generated data meets the required quality standards synthetic data generation will make a bigger impact in the
while minimizing potential risks. future.
Secondly, due to limited attention and the challenges as-
sociated with covering various domains using synthetic data, IX. C ONCLUSION
current methods might not account for all outliers and corner
In conclusion, machine learning has revolutionized various
cases present in the original data. Investigating outliers and
industries by enabling intelligent computer systems to au-
regular instances and their impact on the parameterization of
tonomously tackle tasks, manage and analyze massive volumes
existing methods could be a valuable research direction [258].
of data. However, it still faces several challenges, including
To enhance future detection methods, it may be beneficial
data quality, data scarcity, and data governance. These chal-
to examine the gap between the performance of detection
lenges can be addressed through synthetic data generation,
methods and a well-designed evaluation matrix, which could
which involves the artificial annotation of information gener-
provide insights into areas that require improvement.
ated by computer algorithms or simulations. Synthetic data has
Thirdly, synthetic data generation may involve underlying
been extensively utilized in various sectors due to its ability
models with inherent biases, which might not be immediately
to bridge gaps, especially when real data is either unavailable
evident [92]. Factors such as sample selection biases and
or must be kept private due to privacy or compliance risks.
class imbalances can contribute to these issues. Typically,
This paper has provided a high-level overview of several
algorithms trained with biases in sample selection may un-
state-of-the-art approaches currently being investigated by
derperform when deployed in settings that deviate significantly
machine learning researchers for synthetic data generation. We
from the conditions in which the data was collected [68]. Thus,
have explored different real-world application domains, and
it is crucial to develop methods and strategies that address
examined a diverse array of deep neural network architectures
these biases, ensuring that synthetic data generation leads to
and deep generative models dedicated to generating high-
more accurate and reliable results across diverse applications
quality synthetic data.
and domains.
To sum up, synthetic data generation has enormous potential
Last but not the least, the rise of foundation models in
for unlocking the full potential of machine learning and its
data synthesis presents both significant challenges and oppor-
impact on various industries. While challenges persist in the
tunities. On one hand, foundation models can be exploited
development and application of machine learning technology,
by malicious actors to create sophisticate jailbreak attacks,
synthetic data generation provides a promising solution that
deepfakes, discrimination, exclusion and toxicity problems,
can help address these obstacles. Future research can further
misinformation harms, sensitive information disclosure, and
enhance the functionality of synthetic data generation.
malicious use. These models can generate human-like text and
realistic images or videos, making it difficult for traditional
security measures to detect malicious content. Furthermore,
the accessibility and rapid advancement of these technologies R EFERENCES
lower the barrier for cybercriminals, enabling more sophis- [1] A. Ng, “What artificial intelligence can and can’t do right now,”
ticated and widespread attacks. The ability to generate vast Harvard Business Review, vol. 9, no. 11, 2016.
amounts of realistic, yet fake, data can also overwhelm and [2] M. A. Boden, Artificial intelligence. Elsevier, 1996.
[3] M. Haenlein and A. Kaplan, “A brief history of artificial intelligence:
deceive traditional detection systems, leading to an increase in On the past, present, and future of artificial intelligence,” California
false negatives and undetected breaches. On the other hand, management review, vol. 61, no. 4, pp. 5–14, 2019.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 15
[4] F. Lucini, “The real deal about synthetic data,” MIT Sloan Management data,” in NeurIPS 2022 Workshop on Synthetic Data for Empowering
Review, vol. 63, no. 1, pp. 1–4, 2021. ML Research, 2022.
[5] M. I. Jordan and T. M. Mitchell, “Machine learning: Trends, perspec- [29] B. Nowok, G. M. Raab, and C. Dibben, “synthpop: Bespoke creation
tives, and prospects,” Science, vol. 349, no. 6245, pp. 255–260, 2015. of synthetic data in R,” Journal of statistical software, vol. 74, pp.
[6] L. L. Pipino, Y. W. Lee, and R. Y. Wang, “Data quality assessment,” 1–26, 2016.
Communications of the ACM, vol. 45, no. 4, pp. 211–218, 2002. [30] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-
[7] M. Shen, Y.-T. Chang, C.-T. Wu, S. J. Parker, G. Saylor, Y. Wang, image translation using cycle-consistent adversarial networks,” in IEEE
G. Yu, J. E. Van Eyk, R. Clarke, D. M. Herrington et al., “Comparative international conference on computer vision, 2017.
assessment and novel strategy on methods for imputing proteomics [31] R. Torkzadehmahani, P. Kairouz, and B. Paten, “Dp-cgan: Differentially
data,” Scientific reports, vol. 12, no. 1, p. 1067, 2022. private synthetic data and label generation,” in IEEE/CVF Conference
[8] R. Babbar and B. Schölkopf, “Data scarcity, robustness and extreme on Computer Vision and Pattern Recognition Workshops, 2019.
multi-label classification,” Machine Learning, vol. 108, no. 8, pp. [32] A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN
1329–1351, 2019. training for high fidelity natural image synthesis,” arXiv preprint
[9] S. I. Nikolenko, Synthetic data for deep learning. Springer, 2021, vol. arXiv:1809.11096, 2018.
174. [33] J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet,
[10] V. Bolón-Canedo, N. Sánchez-Maroño, and A. Alonso-Betanzos, “A “Video diffusion models,” arXiv preprint arXiv:2204.03458, 2022.
review of feature selection methods on synthetic data,” Knowledge and [34] A. Razavi, A. Van den Oord, and O. Vinyals, “Generating diverse
information systems, vol. 34, no. 3, pp. 483–519, 2013. high-fidelity images with vq-vae-2,” Advances in neural information
[11] M. Frid-Adar, E. Klang, M. Amitai, J. Goldberger, and H. Greenspan, processing systems, vol. 32, 2019.
“Synthetic data augmentation using gan for improved liver lesion clas- [35] M. Niemeyer and A. Geiger, “Giraffe: Representing scenes as compo-
sification,” in IEEE international symposium on biomedical imaging sitional generative neural feature fields,” in IEEE/CVF Conference on
(ISBI), 2018. Computer Vision and Pattern Recognition, 2021.
[12] Q. Wang, J. Gao, W. Lin, and Y. Yuan, “Learning from synthetic data [36] N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan,
for crowd counting in the wild,” in IEEE/CVF conference on computer “Wavegrad: Estimating gradients for waveform generation,” Interna-
vision and pattern recognition, 2019. tional Conference on Learning Representations (ICLR), 2021.
[13] J. M. Abowd and L. Vilhuber, “How protective are synthetic data?” [37] H. Guo, F. K. Soong, L. He, and L. Xie, “A new GAN-based end-to-end
in International Conference on Privacy in Statistical Databases. TTS training algorithm,” arXiv preprint arXiv:1904.04775, 2019.
Springer, 2008. [38] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative
[14] N. C. Abay, Y. Zhou, M. Kantarcioglu, B. Thuraisingham, and adversarial nets with policy gradient,” in AAAI conference on artificial
L. Sweeney, “Privacy preserving synthetic data release using deep intelligence, vol. 31, no. 1, 2017.
learning,” in Joint European Conference on Machine Learning and [39] T. Sellam, D. Das, and A. P. Parikh, “Bleurt: Learning robust metrics
Knowledge Discovery in Databases. Springer, 2019. for text generation,” arXiv preprint arXiv:2004.04696, 2020.
[15] T. E. Raghunathan, “Synthetic data,” Annual Review of Statistics and [40] Z. Shi, X. Chen, X. Qiu, and X. Huang, “Toward diverse text generation
Its Application, vol. 8, pp. 129–140, 2021. with inverse reinforcement learning,” in International Joint Conference
[16] E. Choi, S. Biswal, B. Malin, J. Duke, W. F. Stewart, and J. Sun, “Gen- on Artificial Intelligence, 2018.
erating multi-label discrete patient records using generative adversarial [41] C.-Y. Ko, P.-Y. Chen, J. Mohapatra, P. Das, and L. Daniel, “Synbench:
networks,” in Machine learning for healthcare conference. PMLR, Task-agnostic benchmarking of pretrained representations using syn-
2017. thetic data,” arXiv preprint arXiv:2210.02989, 2022.
[17] J. D. Ziegler, S. Subramaniam, M. Azzarito, O. Doyle, P. Krusche, [42] W. Nie, N. Narodytska, and A. Patel, “Relgan: Relational generative
and T. Coroller, “Multi-modal conditional GAN: Data synthesis in the adversarial networks for text generation,” in International conference
medical domain,” in NeurIPS 2022 Workshop on Synthetic Data for on learning representations, 2018.
Empowering ML Research, 2022. [43] G. Acs, L. Melis, C. Castelluccia, and E. De Cristofaro, “Differentially
[18] K. W. Dunn, C. Fu, D. J. Ho, S. Lee, S. Han, P. Salama, and E. J. Delp, private mixture of generative neural networks,” IEEE Transactions on
“DeepSynth: Three-dimensional nuclear segmentation of biological Knowledge and Data Engineering, vol. 31, no. 6, pp. 1109–1121, 2018.
images using neural networks trained with synthetic data,” Scientific [44] C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthe-
reports, vol. 9, no. 1, pp. 1–15, 2019. sis,” arXiv preprint arXiv:1802.04208, 2018.
[19] Y. Du, X. Liu, N. Shah, S. Liu, J. Zhang, and B. Zhou, “Chemspace: [45] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
Interpretable and interactive chemical space exploration,” 2022. A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
[20] T. Sterling and J. J. Irwin, “Zinc 15–ligand discovery for everyone,” A generative model for raw audio,” arXiv preprint arXiv:1609.03499,
Journal of chemical information and modeling, vol. 55, no. 11, pp. 2016.
2324–2337, 2015. [46] X. Zhang, I. Vallés-Pérez, A. Stolcke, C. Yu, J. Droppo, O. Shonibare,
[21] W. Jin, R. Barzilay, and T. S. Jaakkola, “Junction tree variational au- R. Barra-Chicote, and V. Ravichandran, “Stutter-tts: Controlled syn-
toencoder for molecular graph generation,” in International Conference thesis and improved recognition of stuttered speech,” arXiv preprint
on Machine Learning, 2018. arXiv:2211.09731, 2022.
[22] M. Olivecrona, T. Blaschke, O. Engkvist, and H. Chen, “Molecular [47] M. Wiese, R. Knobloch, R. Korn, and P. Kretschmer, “Quant GANs:
de-novo design through deep reinforcement learning,” Journal of deep generation of financial time series,” Quantitative Finance, vol. 20,
cheminformatics, vol. 9, no. 1, p. 48, 2017. no. 9, pp. 1419–1440, 2020.
[23] T. Fu, C. Xiao, and J. Sun, “CORE: Automatic molecule optimization [48] R. Fu, J. Chen, S. Zeng, Y. Zhuang, and A. Sudjianto, “Time series
using copy and refine strategy,” AAAI conference on artificial intelli- simulation by conditional generative adversarial net,” arXiv preprint
gence, 2020. arXiv:1904.11419, 2019.
[24] T. Fu, W. Gao, C. W. Coley, and J. Sun, “Reinforced genetic algorithm [49] J. Jordon, J. Yoon, and M. Van Der Schaar, “Pate-gan: Generating
for structure-based drug design,” in Advances in Neural Information synthetic data with differential privacy guarantees,” in International
Processing Systems (NeurIPS), 2022. conference on learning representations, 2018.
[25] K. Huang, T. Fu, W. Gao, Y. Zhao, Y. Roohani, J. Leskovec, C. W. [50] A. Collaboration et al., “Deep generative models for fast photon shower
Coley, C. Xiao, J. Sun, and M. Zitnik, “Artificial intelligence foundation simulation in atlas,” arXiv preprint arXiv:2210.06204, 2022.
for therapeutic science,” Nature Chemical Biology, pp. 1–4, 2022. [51] C. Dewi, R.-C. Chen, Y.-T. Liu, and S.-K. Tai, “Synthetic data
[26] A. Torfi and E. A. Fox, “Corgan: Correlation-capturing convolutional generation using dcgan for improved traffic sign recognition,” Neural
generative adversarial networks for generating synthetic healthcare Computing and Applications, vol. 34, no. 24, pp. 21 465–21 480, 2022.
records,” in International Flairs Conference, 2020. [52] Z. Zhao, K. Xu, S. Li, Z. Zeng, and C. Guan, “Mt-uda: Towards
[27] D. Lee, H. Yu, X. Jiang, D. Rogith, M. Gudala, M. Tejani, Q. Zhang, unsupervised cross-modality medical image segmentation with limited
and L. Xiong, “Generating sequential electronic health records using source labels,” in Medical Image Computing and Computer Assisted
dual adversarial autoencoder,” Journal of the American Medical Infor- Intervention (MICCAI). Springer, 2021.
matics Association, vol. 27, no. 9, pp. 1411–1419, 2020. [53] S. Yi, M. Lu, A. Yee, J. Harmon, F. Meng, and S. Hinduja, “Enhance
[28] S. Wharrie, Z. Yang, V. Raj, R. Monti, R. Gupta, Y. Wang, A. Martin, wound healing monitoring through a thermal imaging based smart-
L. J. O’Connor, S. Kaski, P. Marttinen et al., “HAPNEST: an efficient phone app,” in Medical Imaging: Imaging Informatics for Healthcare,
tool for generating large-scale genetics datasets from limited training Research, and Applications. SPIE, 2018.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 16
[54] Y. Chen, W. Li, X. Chen, and L. V. Gool, “Learning semantic design using a data-driven continuous representation of molecules,”
segmentation from synthetic data: A geometrically guided input-output ACS central science, vol. 4, no. 2, pp. 268–276, 2018.
adaptation approach,” in IEEE/CVF Conference on Computer Vision [77] B. Zhang, Y. Fu, Y. Lu, Z. Zhang, R. Clarke, J. E. Van Eyk,
and Pattern Recognition, 2019. D. M. Herrington, and Y. Wang, “DDN2.0: R and python packages
[55] S. Sankaranarayanan, Y. Balaji, A. Jain, S. N. Lim, and R. Chellappa, for differential dependency network analysis of biological systems,”
“Learning from synthetic data: Addressing domain shift for semantic bioRxiv, pp. 2021–04, 2021.
segmentation,” in IEEE/CVF conference on computer vision and pat- [78] N. De Cao and T. Kipf, “MolGAN: An implicit generative model for
tern recognition, 2018. small molecular graphs,” arXiv preprint arXiv:1805.11973, 2018.
[56] H.-W. Dong and Y.-H. Yang, “Towards a deeper understanding of [79] T. Fu and J. Sun, “Antibody Complementarity Determining Regions
adversarial losses,” arXiv preprint arXiv:1901.08753, 2019. (CDRs) design using constrained energy model,” in ACM SIGKDD
[57] E. Wood, T. Baltrušaitis, C. Hewitt, S. Dziadzio, T. J. Cashman, Conference on Knowledge Discovery and Data Mining, 2022.
and J. Shotton, “Fake it till you make it: face analysis in the wild [80] T. Fu, W. Gao, C. Xiao, J. Yasonik, C. W. Coley, and J. Sun, “Dif-
using synthetic data alone,” in IEEE/CVF international conference on ferentiable scaffolding tree for molecular optimization,” International
computer vision, 2021. Conference on Learning Representations, 2022.
[58] A. Werchniak, R. B. Chicote, Y. Mishchenko, J. Droppo, J. Condal, [81] M. Xu, L. Yu, Y. Song, C. Shi, S. Ermon, and J. Tang, “GeoDiff: A
P. Liu, and A. Shah, “Exploring the application of synthetic audio geometric diffusion model for molecular conformation generation,” in
in training keyword spotters,” in IEEE International Conference on International Conference on Learning Representations, 2021.
Acoustics, Speech and Signal Processing (ICASSP), 2021. [82] Z. Zhou, S. Kearnes, L. Li, R. N. Zare, and P. Riley, “Optimization of
[59] W. Li, H. You, J. Zhu, and N. Chen, “Feature sparsity analysis for molecules via deep reinforcement learning,” Scientific reports, vol. 9,
i-vector based speaker verification,” Speech Communication, vol. 80, no. 1, pp. 1–10, 2019.
pp. 60–70, 2016. [83] J. H. Jensen, “A graph-based genetic algorithm and generative
[60] Y. Qian, Y. Liu, and K. Yu, “Tandem deep features for text-dependent model/monte carlo tree search for the exploration of chemical space,”
speaker verification,” in Fifteenth Annual Conference of the Interna- Chemical science, vol. 10, no. 12, pp. 3567–3572, 2019.
tional Speech Communication Association, 2014. [84] T. Fu, C. Xiao, X. Li, L. M. Glass, and J. Sun, “MIMOSA: Multi-
[61] Z.-H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using constraint molecule sampling for molecule optimization,” in AAAI
restricted boltzmann machines and deep belief networks for statistical Conference on Artificial Intelligence, 2021.
parametric speech synthesis,” IEEE transactions on audio, speech, and [85] T. Fu and J. Sun, “SIPF: Sampling method for inverse protein folding,”
language processing, vol. 21, no. 10, pp. 2129–2139, 2013. in ACM SIGKDD Conference on Knowledge Discovery and Data
[62] A. Fazel, W. Yang, Y. Liu, R. Barra-Chicote, Y. Meng, R. Maas, and Mining, 2022.
J. Droppo, “Synthasr: Unlocking synthetic data for speech recognition,” [86] C. S. Kruse, B. Smith, H. Vanderlinden, and A. Nealand, “Security
arXiv preprint arXiv:2106.07803, 2021. techniques for the electronic health records,” Journal of medical
[63] W. Li and J. Zhu, “An improved i-vector extraction algorithm for systems, vol. 41, no. 8, pp. 1–9, 2017.
speaker verification,” EURASIP Journal on Audio, Speech, and Music [87] Q. Wen, Z. Ouyang, J. Zhang, Y. Qian, Y. Ye, and C. Zhang, “Dis-
Processing, vol. 2015, pp. 1–9, 2015. entangled dynamic heterogeneous graph learning for opioid overdose
[64] G. Forman, “An extensive empirical study of feature selection metrics prediction,” in ACM SIGKDD Conference on Knowledge Discovery
for text classification,” Journal of Machine Learning Research, vol. 3, and Data Mining, 2022.
pp. 1289–1305, 2003. [88] T. Fu, T. Gao, C. Xiao, T. Ma, and J. Sun, “Pearl: Prototype learning
[65] X. Yue, H. A. Inan, X. Li, G. Kumar, J. McAnallen, H. Sun, D. Levitan, via rule learning,” in ACM International Conference on Bioinformatics,
and R. Sim, “Synthetic text generation with differential privacy: A Computational Biology and Health Informatics, 2019, pp. 223–232.
simple and practical recipe,” arXiv preprint arXiv:2210.14348, 2022. [89] A. Goncalves, P. Ray, B. Soper, J. Stevens, L. Coyle, and A. P. Sales,
[66] X. Zheng, Y. Liu, D. Gunceler, and D. Willett, “Using synthetic audio “Generation and evaluation of synthetic patient data,” BMC medical
to improve the recognition of out-of-vocabulary words in end-to-end research methodology, vol. 20, no. 1, pp. 1–40, 2020.
asr systems,” in IEEE International Conference on Acoustics, Speech [90] D. Du, S. Bhardwaj, S. J. Parker, Z. Cheng, Z. Zhang, Y. Lu, J. E.
and Signal Processing (ICASSP), 2021. Van Eyk, G. Yu, R. Clarke, D. M. Herrington et al., “Abds: tool suite for
[67] Z. Zhao, A. Zhu, Z. Zeng, B. Veeravalli, and C. Guan, “Act-net: analyzing biologically diverse samples,” bioRxiv, pp. 2023–07, 2023.
Asymmetric co-teacher network for semi-supervised memory-efficient [91] Y. Lu, “Multi-omics data integration for identifying disease specific
medical image segmentation,” in IEEE International Conference on biological pathways,” Ph.D. dissertation, Virginia Tech, 2018.
Image Processing (ICIP). IEEE, 2022. [92] K. Bhanot, M. Qi, J. S. Erickson, I. Guyon, and K. P. Bennett, “The
[68] R. J. Chen, M. Y. Lu, T. Y. Chen, D. F. Williamson, and F. Mahmood, problem of fairness in synthetic healthcare data,” Entropy, vol. 23,
“Synthetic data in machine learning for medicine and healthcare,” no. 9, p. 1165, 2021.
Nature Biomedical Engineering, vol. 5, no. 6, pp. 493–497, 2021. [93] T. Fu, K. Huang, C. Xiao, L. M. Glass, and J. Sun, “HINT: Hierarchical
[69] A. Tucker, Z. Wang, Y. Rotalinti, and P. Myles, “Generating high- interaction network for clinical-trial-outcome predictions,” Patterns,
fidelity synthetic patient data for assessing machine learning healthcare vol. 3, no. 4, p. 100445, 2022.
software,” NPJ digital medicine, vol. 3, no. 1, pp. 1–13, 2020. [94] T. Fu, T. N. Hoang, C. Xiao, and J. Sun, “DDL: Deep dictionary
[70] Y. Lu, C.-T. Wu, S. J. Parker, Z. Cheng, G. Saylor, J. E. Van Eyk, learning for predictive phenotyping,” in International Joint Conference
G. Yu, R. Clarke, D. M. Herrington, and Y. Wang, “Cot: an efficient and on Artificial Intelligence, 2019.
accurate method for detecting marker genes among many subtypes,” [95] L. Chen, Y. Lu, C.-T. Wu, R. Clarke, G. Yu, J. E. Van Eyk, D. M.
Bioinformatics Advances, vol. 2, no. 1, p. vbac037, 2022. Herrington, and Y. Wang, “Data-driven detection of subtype-specific
[71] J. Dahmen and D. Cook, “Synsys: A synthetic data generation system differentially expressed genes,” Scientific reports, vol. 11, no. 1, pp.
for healthcare applications,” Sensors, vol. 19, no. 5, p. 1181, 2019. 1–12, 2021.
[72] Y. Lu, Y.-T. Chang, E. P. Hoffman, G. Yu, D. M. Herrington, R. Clarke, [96] P. Eigenschink, S. Vamosi, R. Vamosi, C. Sun, T. Reutterer, and
C.-T. Wu, L. Chen, and Y. Wang, “Integrated identification of disease K. Kalcher, “Deep generative models for synthetic data,” ACM Com-
specific pathways using multi-omics data,” bioRxiv, p. 666065, 2019. puting Surveys, 2021.
[73] Z. Wang, P. Myles, and A. Tucker, “Generating and evaluating cross- [97] C.-T. Wu, M. Shen, D. Du, Z. Cheng, S. J. Parker, Y. Lu, J. E. Van Eyk,
sectional synthetic electronic healthcare data: Preserving data utility G. Yu, R. Clarke, D. M. Herrington et al., “Cosbin: cosine score-based
and patient privacy,” Computational Intelligence, vol. 37, no. 2, pp. iterative normalization of biologically diverse samples,” Bioinformatics
819–851, 2021. Advances, vol. 2, no. 1, p. vbac076, 2022.
[74] R. S. Bohacek, C. McMartin, and W. C. Guida, “The art and practice [98] R. Wang and X. Qu, “Eeg daydreaming, a machine learning approach to
of structure-based drug design: a molecular modeling perspective,” detect daydreaming activities,” in Augmented Cognition: International
Medicinal research reviews, vol. 16, no. 1, pp. 3–50, 1996. Conference. Springer, 2022.
[75] K. Huang, T. Fu, L. M. Glass, M. Zitnik, C. Xiao, and J. Sun, [99] Y. Du, T. Fu, J. Sun, and S. Liu, “Molgensurvey: A systematic
“DeepPurpose: a deep learning library for drug–target interaction survey in machine learning models for molecule design,” arXiv preprint
prediction,” Bioinformatics, vol. 36, no. 22-23, pp. 5545–5547, 2020. arXiv:2203.14500, 2022.
[76] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández- [100] K. El Emam, L. Mosquera, and R. Hoptroff, Practical synthetic data
Lobato, B. Sánchez-Lengeling, D. Sheberla, J. Aguilera-Iparraguirre, generation: balancing privacy and the broad availability of data.
T. D. Hirzel, R. P. Adams, and A. Aspuru-Guzik, “Automatic chemical O’Reilly Media, 2020.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 17
[101] M. Mannino and A. Abouzied, “Is this real? generating synthetic data [124] Y. Cao, S. Li, Y. Liu, Z. Yan, Y. Dai, P. S. Yu, and L. Sun, “A
that looks real,” in ACM Symposium on User Interface Software and comprehensive survey of ai-generated content (aigc): A history of
Technology, 2019. generative ai from gan to chatgpt,” arXiv preprint arXiv:2303.04226,
[102] S. A. Assefa, D. Dervovic, M. Mahfouz, R. E. Tillman, P. Reddy, 2023.
and M. Veloso, “Generating synthetic data in finance: opportunities, [125] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
challenges and pitfalls,” in ACM International Conference on AI in Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”
Finance, 2020. Advances in neural information processing systems, 2017.
[103] P.-H. Lu, P.-C. Wang, and C.-M. Yu, “Empirical evaluation on synthetic [126] W. Tao, S. Gao, and Y. Yuan, “Boundary crossing: an experimental
data generation with generative adversarial network,” in International study of individual perceptions toward aigc,” Frontiers in Psychology,
Conference on Web Intelligence, Mining and Semantics, 2019. vol. 14, 2023.
[104] M. Hittmeir, A. Ekelhart, and R. Mayer, “On the utility of synthetic [127] R. J. M. Ventayen, “Openai chatgpt generated results: Similarity index
data: An empirical evaluation on machine learning tasks,” in Interna- of artificial intelligence-based contents,” Available at SSRN 4332664,
tional Conference on Availability, Reliability and Security, 2019. 2023.
[105] A. M. Berg, S. T. Mol, G. Kismihók, and N. Sclater, “The role of a [128] T. Yue, D. Au, C. C. Au, and K. Y. Iu, “Democratizing financial knowl-
reference synthetic data generator within the field of learning analytics.” edge with chatgpt by openai: Unleashing the power of technology,”
Journal of Learning Analytics, vol. 3, no. 1, pp. 107–128, 2016. Available at SSRN 4346152, 2023.
[106] B. Howe, J. Stoyanovich, H. Ping, B. Herman, and M. Gee, “Synthetic [129] V. Cheng, V. M. Suriyakumar, N. Dullerud, S. Joshi, and M. Ghas-
data for social good,” arXiv preprint arXiv:1710.08874, 2017. semi, “Can you fake it until you make it? impacts of differentially
[107] P. Bautista and P. S. Inventado, “Protecting student privacy with private synthetic data on downstream classification fairness,” in ACM
synthetic data from generative adversarial networks,” in International Conference on Fairness, Accountability, and Transparency, 2021.
Conference on Artificial Intelligence in Education. Springer, 2021. [130] J. Hurst, K. Mayorov, and J. F. T. Tatsinkou, “The generation of
[108] H. Jiang, J. Li, P. Zhao, F. Zeng, Z. Xiao, and A. Iyengar, “Location synthetic data for risk modelling,” Journal of Risk Management in
privacy-preserving mechanisms in location-based services: A compre- Financial Institutions, vol. 15, no. 3, pp. 260–269, 2022.
hensive survey,” ACM Computing Surveys (CSUR), vol. 54, no. 1, pp. [131] Y.-L. Peng and W.-P. Lee, “Data selection to avoid overfitting for
1–36, 2021. foreign exchange intraday trading with machine learning,” Applied Soft
[109] R. Kato, M. Iwata, T. Hara, A. Suzuki, X. Xie, Y. Arase, and S. Nishio, Computing, vol. 108, p. 107461, 2021.
“A dummy-based anonymization method based on user trajectory [132] M. J. Schneider, S. Jagpal, S. Gupta, S. Li, and Y. Yu, “A flexible
with pauses,” in International Conference on Advances in Geographic method for protecting marketing data: An application to point-of-sale
Information Systems, 2012. data,” Marketing Science, vol. 37, no. 1, pp. 153–171, 2018.
[110] Y. Du, S. Wang, X. Guo, H. Cao, S. Hu, J. Jiang, A. Varala, [133] D. M. Smith, G. P. Clarke, and K. Harland, “Improving the synthetic
A. Angirekula, and L. Zhao, “GraphGT: Machine learning datasets data generation process in spatial microsimulation models,” Environ-
for graph generation and transformation,” in Neural Information Pro- ment and Planning A, vol. 41, no. 5, pp. 1251–1268, 2009.
cessing Systems Datasets and Benchmarks Track, 2021. [134] G. Papyshev and M. Yarime, “Exploring city digital twins as policy
tools: A task-based approach to generating synthetic data on urban
[111] R. Chen, G. Acs, and C. Castelluccia, “Differentially private sequential
mobility,” Data & Policy, vol. 3, p. e16, 2021.
data publication via variable-length n-grams,” in ACM conference on
[135] Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond,
Computer and communications security, 2012.
T. Eccles, J. Keeling, F. Gimeno, A. Dal Lago et al., “Competition-
[112] C. Dwork, A. Roth et al., “The algorithmic foundations of differential
level code generation with alphacode,” Science, vol. 378, no. 6624, pp.
privacy,” Foundations and Trends® in Theoretical Computer Science,
1092–1097, 2022.
vol. 9, no. 3–4, pp. 211–407, 2014.
[136] T. Ye, Y. Du, T. Ma, L. Wu, X. Zhang, S. Ji, and W. Wang, “Uncovering
[113] T. Cunningham, G. Cormode, H. Ferhatosmanoglu, and D. Srivastava,
llm-generated code: A zero-shot synthetic code detector via code
“Real-world trajectory sharing with local differential privacy,” Proceed-
rewriting,” arXiv preprint arXiv:2405.16133, 2024.
ings of the VLDB Endowment, vol. 14, no. 11, pp. 2283–2295, 2021.
[137] V. Seib, B. Lange, and S. Wirtz, “Mixing real and synthetic data to
[114] Y. Du, Y. Hu, Z. Zhang, Z. Fang, L. Chen, B. Zheng, and Y. Gao, “Ldp- enhance neural network training–a review of current approaches,” arXiv
trace: Locally differentially private trajectory synthesis,” Proceedings preprint arXiv:2007.08781, 2020.
of the VLDB Endowment, vol. 16, no. 8, pp. 1897–1909, 2023. [138] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT
[115] X. He, G. Cormode, A. Machanavajjhala, C. M. Procopiuc, and Press, 2016.
D. Srivastava, “Dpt: differentially private trajectory synthesis using [139] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural net-
hierarchical reference systems,” VLDB Endowment, vol. 8, no. 11, pp. works,” IEEE transactions on Signal Processing, vol. 45, no. 11, pp.
1154–1165, 2015. 2673–2681, 1997.
[116] S. Wang and R. O. Sinnott, “Protecting personal trajectories of so- [140] S. Hochreiter and J. Schmidhuber, “Lstm can solve hard long time lag
cial media users through differential privacy,” Computers & Security, problems,” Advances in neural information processing systems, vol. 9,
vol. 67, pp. 142–163, 2017. 1996.
[117] M. E. Gursoy, L. Liu, S. Truex, L. Yu, and W. Wei, “Utility-aware [141] K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares,
synthesis of differentially private and attack-resilient location traces,” H. Schwenk, and Y. Bengio, “Learning phrase representations using
in ACM SIGSAC conference on computer and communications security, RNN Encoder–Decoder for statistical machine translation,” in Confer-
2018. ence on Empirical Methods in Natural Language Processing (EMNLP),
[118] M. E. Gursoy, L. Liu, S. Truex, and L. Yu, “Differentially private and 2014.
utility preserving publication of trajectory data,” IEEE Transactions on [142] T. N. Kipf and M. Welling, “Semi-supervised classification with
Mobile Computing, vol. 18, no. 10, pp. 2315–2329, 2018. graph convolutional networks,” International Conference on Learning
[119] D. J. Mir, S. Isaacman, R. Cáceres, M. Martonosi, and R. N. Wright, Representations (ICLR), 2016.
“Dp-where: Differentially private modeling of human mobility,” in [143] Y. Qian, Y. Zhang, Y. Ye, C. Zhang et al., “Distilling meta knowledge
IEEE international conference on big data. IEEE, 2013. on heterogeneous graph for illicit drug trafficker detection on social
[120] H. Roy, M. Kantarcioglu, and L. Sweeney, “Practical differentially media,” in Advances in Neural Information Processing Systems, 2021.
private modeling of human movement data,” in Annual IFIP WG 11.3 [144] W. Fan, Y. Ma, Q. Li, Y. He, E. Zhao, J. Tang, and D. Yin, “Graph
Working Conference on Data and Applications Security and Privacy. neural networks for social recommendation,” in The world wide web
Springer, 2016. conference, 2019.
[121] V. Bindschaedler and R. Shokri, “Synthesizing plausible privacy- [145] Y. Ma, S. Wang, C. C. Aggarwal, and J. Tang, “Graph convolutional
preserving location traces,” in IEEE Symposium on Security and networks with eigenpooling,” in ACM SIGKDD international confer-
Privacy (SP), 2016. ence on knowledge discovery & data mining, 2019, pp. 723–731.
[122] H. Wang, Z. Zhang, T. Wang, S. He, M. Backes, J. Chen, and [146] Y. Qian, Y. Zhang, Q. Wen, Y. Ye, and C. Zhang, “Rep2vec: Repository
Y. Zhang, “Privtrace: Differentially private trajectory synthesis by embedding via heterogeneous graph adversarial contrastive learning,”
adaptive markov model,” in USENIX Security Symposium 2023, 2023. in ACM SIGKDD Conference on Knowledge Discovery and Data
[123] J. Narita, T. Murakami, H. Hino, M. Nishigaki, and T. Ohki, “Syn- Mining, 2022.
thesizing differentially private location traces including co-locations,” [147] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan et al.,
International Journal of Information Security, vol. 23, no. 1, pp. 389– “Language models are few-shot learners,” in Advances in Neural
410, 2024. Information Processing Systems, 2020.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18
[148] S. Ghosh, M. Chollet, E. Laksana, L.-P. Morency, and S. Scherer, [173] J. Geiping, H. Bauermeister, H. Dröge, and M. Moeller, “Inverting
“Affect-LM: A neural language model for customizable affective text gradients-how easy is it to break privacy in federated learning?”
generation,” in Annual Meeting of the Association for Computational Advances in Neural Information Processing Systems, 2020.
Linguistics, 2017. [174] J. Hayes, L. Melis, G. Danezis, and E. De Cristofaro, “Logan: Mem-
[149] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” bership inference attacks against generative models,” Proceedings on
International Conference on Learning Representations (ICLR), 2014. Privacy Enhancing Technologies, 2019.
[150] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, [175] B. Hitaj, G. Ateniese, and F. Perez-Cruz, “Deep models under the
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” gan: information leakage from collaborative deep learning,” in ACM
in Advances in neural information processing systems, 2014. SIGSAC conference on computer and communications security, 2017.
[151] I. J. Goodfellow, “On distinguishability criteria for estimating genera- [176] Z. Wang, M. Song, Z. Zhang, Y. Song, Q. Wang, and H. Qi, “Be-
tive models,” arXiv preprint arXiv:1412.6515, 2014. yond inferring class representatives: User-level privacy leakage from
[152] Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, federated learning,” in IEEE conference on computer communications,
and B. Poole, “Score-based generative modeling through stochastic 2019.
differential equations,” arXiv preprint arXiv:2011.13456, 2020. [177] G. Ganev and E. De Cristofaro, “On the inadequacy of similarity-
[153] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. based privacy metrics: Reconstruction attacks against” truly anonymous
MIT press, 2018. synthetic data”,” arXiv preprint arXiv:2312.05114, 2023.
[154] S. H. Lee, “Natural language generation for electronic health records,” [178] B. Hilprecht, M. Härterich, and D. Bernau, “Monte carlo and re-
NPJ digital medicine, vol. 1, no. 1, pp. 1–7, 2018. construction membership inference attacks against generative models,”
[155] W. Hu, B. Liu, J. Gomes, M. Zitnik, P. Liang, V. Pande, and Proceedings on Privacy Enhancing Technologies, 2019.
J. Leskovec, “Strategies for pre-training graph neural networks,” in [179] Y. Xu, S. Mukherjee, X. Liu, S. Tople, R. M. Dodhia, and J. M. L. Fer-
International Conference on Learning Representations, 2019. res, “Mace: A flexible framework for membership privacy estimation
[156] C. P. Burgess, I. Higgins, A. Pal, L. Matthey, N. Watters, G. Des- in generative models,” Transactions on Machine Learning Research,
jardins, and A. Lerchner, “Understanding disentangling in β-vae,” 2022.
arXiv preprint arXiv:1804.03599, 2018. [180] T. Stadler, B. Oprisanu, and C. Troncoso, “Synthetic data–
[157] A. Vahdat and J. Kautz, “Nvae: A deep hierarchical variational autoen- anonymisation groundhog day,” in USENIX Security Symposium, 2022.
coder,” Advances in neural information processing systems, vol. 33, [181] D. Chen, N. Yu, Y. Zhang, and M. Fritz, “Gan-leaks: A taxonomy
2020. of membership inference attacks against generative models,” in ACM
[158] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, SIGSAC conference on computer and communications security, 2020.
S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial net- [182] V. Bindschaedler, R. Shokri, and C. A. Gunter, “Plausible deniability
works,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, for privacy-preserving data synthesis,” VLDB Endowment, vol. 10,
2020. no. 5, 2017.
[159] Y. Zhang, Y. Qian, Y. Fan, Y. Ye, X. Li, Q. Xiong, and F. Shao, “dstyle- [183] B.-W. Tseng and P.-Y. Wu, “Compressive privacy generative adversarial
gan: Generative adversarial network based on writing and photography network,” IEEE Transactions on Information Forensics and Security,
styles for drug identification in darknet markets,” in Annual Computer vol. 15, pp. 2499–2513, 2020.
Security Applications Conference, 2020. [184] X. Zhang, S. Ji, and T. Wang, “Differentially private releasing via deep
[160] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative generative model (technical report),” arXiv preprint arXiv:1801.01594,
adversarial networks,” in International conference on machine learning. 2018.
PMLR, 2017. [185] L. Xie, K. Lin, S. Wang, F. Wang, and J. Zhou, “Differentially pri-
[161] Y. Zhang, Y. Qian, Y. Ye, and C. Zhang, “Adapting distilled knowledge vate generative adversarial network,” arXiv preprint arXiv:1802.06739,
for few-shot relation reasoning over knowledge graphs,” in SIAM 2018.
International Conference on Data Mining (SDM), 2022. [186] C. Xu, J. Ren, D. Zhang, Y. Zhang, Z. Qin, and K. Ren, “Ganobfusca-
[162] J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, tor: Mitigating information leakage under gan via differential privacy,”
“Deep unsupervised learning using nonequilibrium thermodynamics,” IEEE Transactions on Information Forensics and Security, vol. 14,
in International Conference on Machine Learning. PMLR, 2015. no. 9, pp. 2358–2371, 2019.
[163] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic [187] A. Triastcyn and B. Faltings, “Federated generative privacy,” IEEE
models,” in Advances in Neural Information Processing Systems, 2020. Intelligent Systems, vol. 35, no. 4, pp. 50–57, 2020.
[164] M. Liu, K. Yan, B. Oztekin, and S. Ji, “Graphebm: Molecu- [188] P.-H. Lu and C.-M. Yu, “Poster: A unified framework of differentially
lar graph generation with energy-based models,” arXiv preprint private synthetic data release with generative adversarial network,” in
arXiv:2102.00546, 2021. ACM SIGSAC Conference on Computer and Communications Security,
[165] L. Weng, “What are diffusion models?” lilianweng.github.io/lil-log, 2017.
2021. [Online]. Available: https://lilianweng.github.io/lil-log/2021/07/ [189] Y. Liu, J. Peng, J. James, and Y. Wu, “Ppgan: Privacy-preserving
11/diffusion-models.html generative adversarial network,” in IEEE international conference on
[166] G. Somepalli, V. Singla, M. Goldblum, J. Geiping, and T. Goldstein, parallel and distributed systems (ICPADS). IEEE, 2019.
“Diffusion art or digital forgery? investigating data replication in [190] L. Frigerio, A. S. de Oliveira, L. Gomez, and P. Duverger, “Dif-
diffusion models,” arXiv preprint arXiv:2212.03860, 2022. ferentially private generative adversarial networks for time series,
[167] K. Huang, T. Fu, W. Gao, Y. Zhao, Y. Roohani, J. Leskovec, C. W. continuous, and discrete open data,” in IFIP TC 11 International
Coley, C. Xiao, J. Sun, and M. Zitnik, “Therapeutics data commons: Conference on ICT Systems Security and Privacy Protection. Springer,
machine learning datasets and tasks for therapeutics,” NeurIPS Track 2019, pp. 151–164.
Datasets and Benchmarks, 2021. [191] B. K. Beaulieu-Jones, Z. S. Wu, C. Williams, R. Lee, S. P. Bhavnani,
[168] R. Shokri, M. Stronati, C. Song, and V. Shmatikov, “Membership in- J. B. Byrd, and C. S. Greene, “Privacy-preserving generative deep neu-
ference attacks against machine learning models,” in IEEE symposium ral networks support clinical data sharing,” Circulation: Cardiovascular
on security and privacy (SP), 2017. Quality and Outcomes, vol. 12, no. 7, p. e005122, 2019.
[169] S. Truex, L. Liu, M. E. Gursoy, L. Yu, and W. Wei, “Demystifying [192] G. Astolfi and K. David, “Generating tabular data using generative ad-
membership inference attacks in machine learning as a service,” IEEE versarial networks with differential privacy,” in Conference of European
Transactions on Services Computing, vol. 14, no. 6, pp. 2073–2089, Statisticians, 2021.
2019. [193] D. Chen, T. Orekondy, and M. Fritz, “Gs-wgan: A gradient-sanitized
[170] L. Melis, C. Song, E. De Cristofaro, and V. Shmatikov, “Exploiting un- approach for learning differentially private generators,” Advances in
intended feature leakage in collaborative learning,” in IEEE symposium Neural Information Processing Systems, vol. 33, 2020.
on security and privacy (SP), 2019. [194] T. Cao, A. Bie, A. Vahdat, S. Fidler, and K. Kreis, “Don’t generate
[171] L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” Advances me: Training differentially private generative models with sinkhorn
in neural information processing systems, 2019. divergence,” Advances in Neural Information Processing Systems,
[172] W. Wei, L. Liu, M. Loper, K.-H. Chow, M. E. Gursoy, S. Truex, and vol. 34, 2021.
Y. Wu, “A framework for evaluating client privacy leakages in federated [195] A. Torfi, E. A. Fox, and C. K. Reddy, “Differentially private synthetic
learning,” in European Symposium on Research in Computer Security. medical data generation using convolutional gans,” Information Sci-
Springer, 2020. ences, vol. 586, pp. 485–500, 2022.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 19
[196] L. Fan and A. Pokkunuru, “Dpnet: Differentially private network [219] P. Samarati, “Protecting respondents identities in microdata release,”
traffic synthesis with generative adversarial networks,” in IFIP Annual IEEE transactions on Knowledge and Data Engineering, vol. 13, no. 6,
Conference on Data and Applications Security and Privacy. Springer, pp. 1010–1027, 2001.
2021. [220] B. Barak, K. Chaudhuri, C. Dwork, S. Kale, F. McSherry, and
[197] Y. Long, B. Wang, Z. Yang, B. Kailkhura, A. Zhang, C. Gunter, and K. Talwar, “Privacy, accuracy, and consistency too: a holistic solution
B. Li, “G-pate: Scalable differentially private data generator via private to contingency table release,” in ACM SIGMOD-SIGACT-SIGART
aggregation of teacher discriminators,” Advances in Neural Information symposium on Principles of database systems, 2007.
Processing Systems, vol. 34, 2021. [221] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacy and
[198] T. Cunningham, K. Klemmer, H. Wen, and H. Ferhatosmanoglu, “Geo- statistical minimax rates,” in IEEE Annual Symposium on Foundations
pointgan: Synthetic spatial data with local label differential privacy,” of Computer Science, 2013.
arXiv preprint arXiv:2205.08886, 2022. [222] S.-Y. Kung, “Compressive privacy: From information\/estimation the-
[199] M. Vinaroz, M.-A. Charusaie, F. Harder, K. Adamczewski, and M. J. ory to machine learning,” IEEE Signal Processing Magazine, vol. 34,
Park, “Hermite polynomial features for private data generation,” in no. 1, pp. 94–112, 2017.
International Conference on Machine Learning. PMLR, 2022. [223] H. Li, L. Xiong, and X. Jiang, “Differentially private synthesization
[200] F. Harder, K. Adamczewski, and M. Park, “Dp-merf: Differentially of multi-dimensional data using copula functions,” in International
private mean embeddings with randomfeatures for practical privacy- Conference on Extending Database Technology. NIH Public Access,
preserving data generation,” in International conference on artificial 2014.
intelligence and statistics. PMLR, 2021. [224] N. Patki, R. Wedge, and K. Veeramachaneni, “The synthetic data
[201] R. Chen, Q. Xiao, Y. Zhang, and J. Xu, “Differentially private high- vault,” in IEEE international conference on data science and advanced
dimensional data publication via sampling-based inference,” in ACM analytics (DSAA). IEEE, 2016.
SIGKDD international conference on knowledge discovery and data [225] S. Gambs, F. Ladouceur, A. Laurent, and A. Roy-Gaumond, “Growing
mining, 2015. synthetic data through differentially-private vine copulas,” Proceedings
[202] K. Cai, X. Lei, J. Wei, and X. Xiao, “Data synthesis via differentially on Privacy Enhancing Technologies, 2021.
private markov random fields,” VLDB Endowment, vol. 14, no. 11, pp. [226] H. J. Asghar, M. Ding, T. Rakotoarivelo, S. Mrabet, and M. A. Kaafar,
2190–2202, 2021. “Differentially private release of high-dimensional datasets using the
[203] R. McKenna, B. Mullins, D. Sheldon, and G. Miklau, “Aim: An gaussian copula,” arXiv preprint arXiv:1902.01499, 2019.
adaptive and iterative mechanism for differentially private synthetic [227] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré, “Holoclean: Holistic data
data,” arXiv preprint arXiv:2201.12677, 2022. repairs with probabilistic inference,” arXiv preprint arXiv:1702.00820,
[204] C.-H. Lin, C.-M. Yu, and C.-Y. Huang, “Dpview: Differentially private 2017.
data synthesis through domain size information,” IEEE Internet of [228] H. Wang, S. Sudalairaj, J. Henning, K. Greenewald, and A. Srivastava,
Things Journal, vol. 9, no. 17, pp. 15 886–15 900, 2022. “Post-processing private synthetic data for improving utility on selected
[205] V. Chandrasekaran, D. Edge, S. Jha, A. Sharma, C. Zhang, and measures,” Advances in Neural Information Processing Systems, 2024.
S. Tople, “Causally constrained data synthesis for private data release,” [229] N. Carlini, F. Tramer, E. Wallace, M. Jagielski, A. Herbert-Voss,
arXiv preprint arXiv:2105.13144, 2021. K. Lee, A. Roberts, T. B. Brown, D. Song, U. Erlingsson et al.,
[206] J. Zhang, G. Cormode, C. M. Procopiuc, D. Srivastava, and X. Xiao, “Extracting training data from large language models.” in USENIX
“Privbayes: Private data release via bayesian networks,” ACM Trans- Security Symposium, 2021.
actions on Database Systems (TODS), vol. 42, no. 4, pp. 1–41, 2017. [230] N. Carlini, J. Hayes, M. Nasr, M. Jagielski, V. Sehwag, F. Tramer,
[207] C. Ge, S. Mohapatra, X. He, and I. F. Ilyas, “Kamino: Constraint-aware B. Balle, D. Ippolito, and E. Wallace, “Extracting training data from
differentially private data synthesis,” arXiv preprint arXiv:2012.15713, diffusion models,” arXiv preprint arXiv:2301.13188, 2023.
2020. [231] C. Meehan, K. Chaudhuri, and S. Dasgupta, “A non-parametric test to
[208] M. Gaboardi, E. J. G. Arias, J. Hsu, A. Roth, and Z. S. Wu, “Dual detect data-copying in generative models,” in International Conference
query: Practical private query release for high dimensional data,” in on Artificial Intelligence and Statistics, 2020.
International Conference on Machine Learning, 2014. [232] Q. Feng, C. Guo, F. Benitez-Quiroz, and A. M. Martinez, “When
[209] M. Hardt, K. Ligett, and F. McSherry, “A simple and practical algorithm do gans replicate? on the choice of dataset size,” in IEEE/CVF
for differentially private data release,” Advances in neural information International Conference on Computer Vision, 2021.
processing systems, 2012. [233] N. Kandpal, E. Wallace, and C. Raffel, “Deduplicating training data
[210] Z. Zhang, T. Wang, N. Li, J. Honorio, M. Backes, S. He, J. Chen, mitigates privacy risks in language models,” in International Confer-
and Y. Zhang, “{PrivSyn}: Differentially private data synthesis,” in ence on Machine Learning. PMLR, 2022.
USENIX Security Symposium, 2021. [234] T. Dockhorn, T. Cao, A. Vahdat, and K. Kreis, “Differentially private
[211] Q. Chen, C. Xiang, M. Xue, B. Li, N. Borisov, D. Kaarfar, and diffusion models,” arXiv preprint arXiv:2210.09929, 2022.
H. Zhu, “Differentially private data generative models,” arXiv preprint [235] Y.-L. Tsai, Y. Li, Z. Chen, P.-Y. Chen, C.-M. Yu, X. Ren, and F. Buet-
arXiv:1812.02274, 2018. Golfouse, “Differentially private fine-tuning of diffusion models,” arXiv
[212] E. Bao, X. Xiao, J. Zhao, D. Zhang, and B. Ding, “Synthetic data preprint arXiv:2406.01355, 2024.
generation with differential privacy via bayesian networks,” Journal of [236] H. Wang, S. Pang, Z. Lu, Y. Rao, Y. Zhou, and M. Xue, “dp-
Privacy and Confidentiality, 2021. promise: Differentially private diffusion probabilistic models for image
[213] F. Liu, Z. Cheng, H. Chen, Y. Wei, L. Nie, and M. Kankanhalli, synthesis,” in USENIX Security Symposium, 2024.
“Privacy-preserving synthetic data generation for recommendation sys- [237] S. Lyu, M. F. Liu, M. Vinaroz, and M. Park, “Differentially private
tems,” in ACM SIGIR Conference on Research and Development in latent diffusion models,” arXiv preprint arXiv:2305.15759, 2023.
Information Retrieval, 2022. [238] J. Lebensold, M. Sanjabi, P. Astolfi, A. Romero-Soriano, K. Chaudhuri,
[214] J.-W. Chen, C.-M. Yu, C.-C. Kao, T.-W. Pang, and C.-S. Lu, “Dpgen: M. Rabbat, and C. Guo, “Dp-rdm: Adapting diffusion models to private
Differentially private generative energy-guided network for natural domains without fine-tuning,” arXiv preprint arXiv:2403.14421, 2024.
image synthesis,” in IEEE/CVF Conference on Computer Vision and [239] D. Yu, S. Naik, A. Backurs, S. Gopi, H. A. Inan, G. Kamath,
Pattern Recognition, 2022. J. Kulkarni, Y. T. Lee, A. Manoel, L. Wutschitz et al., “Differentially
[215] Z. Zhang, T. Wang, N. Li, S. He, and J. Chen, “Calm: Consistent private fine-tuning of language models,” in International Conference
adaptive local marginal for marginal release under local differential on Learning Representations, 2021.
privacy,” in ACM SIGSAC Conference on Computer and Communica- [240] W. Wei and L. Liu, “Trustworthy distributed ai systems: Robustness,
tions Security, 2018. privacy, and governance,” ACM Computing Surveys, 2024.
[216] W. Qardaji, W. Yang, and N. Li, “Priview: practical differentially [241] B. Oprisanu, G. Ganev, and E. De Cristofaro, “Measuring utility and
private release of marginal contingency tables,” in ACM SIGMOD privacy of synthetic genomic data,” arXiv preprint arXiv:2102.03314,
international conference on Management of data, 2014. 2021.
[217] L. Sweeney, “k-anonymity: A model for protecting privacy,” Interna- [242] M. Pereira, M. Kshirsagar, S. Mukherjee, R. Dodhia, and J. Ferres, “An
tional journal of uncertainty, fuzziness and knowledge-based systems, analysis of the deployment of models trained on private tabular syn-
vol. 10, no. 05, pp. 557–570, 2002. thetic data: Unexpected surprises,” arXiv preprint arXiv:2106.10241.
[218] P. Samarati and L. Sweeney, “Generalizing data to provide anonymity [243] G. Ganev, B. Oprisanu, and E. De Cristofaro, “Robin hood and matthew
when disclosing information,” in ACM SIGACT-SIGMOD-SIGART effects: Differential privacy has disparate impact on synthetic data,” in
Symposium on Principles of Database Systems, 1998. International Conference on Machine Learning. PMLR, 2022.
JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 20
[244] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, Huazheng Wang is an Assistant Professor of Com-
“A survey on bias and fairness in machine learning,” ACM Computing puter Science at Oregon State University. He ob-
Surveys (CSUR), vol. 54, no. 6, pp. 1–35, 2021. tained his Ph.D. in Computer Science from Univer-
[245] A. Abusitta, E. Aı̈meur, and O. A. Wahab, “Generative adversarial sity of Virginia and B.E. in Computer Science from
networks for mitigating biases in machine learning systems,” arXiv University of Science and Technology of China.
preprint arXiv:1905.09972, 2019. His research focuses on theory and applications
[246] F. H. K. d. S. Tanaka and C. Aranha, “Data augmentation using gans,” of machine learning, reinforcement learning, and
arXiv preprint arXiv:1904.09135, 2019. information retrieval. He is a recipient of SIGIR
[247] G. Mariani, F. Scheidegger, R. Istrate, C. Bekas, and C. Malossi, 2019 Best Paper Award.
“Bagan: Data augmentation with balancing gan,” arXiv preprint
arXiv:1803.09655, 2018.
[248] E. Barbierato, M. L. D. Vedova, D. Tessera, D. Toti, and N. Vanoli,
“A methodology for controlling bias and fairness in synthetic data Xiao Wang is a postdoc at University of Wash-
generation,” Applied Sciences, vol. 12, no. 9, p. 4619, 2022. ington, Seattle, USA. He received his PhD degree
[249] D. Xu, S. Yuan, L. Zhang, and X. Wu, “Fairgan: Fairness-aware Purdue University, West Lafayette, IN, USA in 2023
generative adversarial networks,” in IEEE International Conference on and B.S. degree from Department of Computer Sci-
Big Data, 2018. ence in Xi’an Jiaotong University, Xi’an, China in
[250] P. Sattigeri, S. C. Hoffman, V. Chenthamarakshan, and K. R. Varshney, 2018. He is the associate editor of IEEE Trans-
“Fairness gan: Generating datasets with fairness properties using a actions on Intelligent Vehicles. He has published
generative adversarial network,” IBM Journal of Research and Devel- numerous 1st-author papers in a broad range of
opment, vol. 63, no. 4/5, pp. 3–1, 2019. venues: Nature Methods, Nature Communications
[251] B. Van Breugel, T. Kyono, J. Berrevoets, and M. Van der Schaar, and IEEE-TPAMI. His research interests include
“Decaf: Generating fair synthetic data using causally-aware genera- deep learning, computer vision, bioinformatics and
tive networks,” Advances in Neural Information Processing Systems, intelligent systems.
vol. 34, 2021.
[252] E. Bagdasaryan, O. Poursaeed, and V. Shmatikov, “Differential privacy
has disparate impact on model accuracy,” Advances in neural informa- Capucine Van Rechem is Assistant Professor
tion processing systems, vol. 32, 2019. of Pathology at Stanford University, School of
[253] L. Weidinger, J. Mellor, M. Rauh, C. Griffin, J. Uesato, P.-S. Huang, Medicine. She focuses on the molecular impact of
M. Cheng, M. Glaese, B. Balle, A. Kasirzadeh et al., “Ethical and social chromatin modifiers on disease development, with
risks of harm from language models,” arXiv preprint arXiv:2112.04359, an emphasis on cancer. Her laboratory undertakes
2021. a cell-cycle specific angle to explore functions such
[254] G. K. Anumanchipalli, J. Chartier, and E. F. Chang, “Speech synthesis as gene expression and replication timing. They also
from neural decoding of spoken sentences,” Nature, vol. 568, no. 7753, explore unconventional direct roles for these factors
pp. 493–498, 2019. in the cytoplasm, with a focus on protein synthesis.
[255] C. Yan, Y. Yan, Z. Wan, Z. Zhang, L. Omberg, J. Guinney, S. D. Their ultimate goal is to provide needed insights into
Mooney, and B. A. Malin, “A multifaceted benchmarking of synthetic new targeted therapies.
electronic health record generation models,” Nature Communications,
vol. 13, no. 1, pp. 1–18, 2022.
Tianfan Fu is a tenure-track assistant professor
[256] C. Esteban, S. L. Hyland, and G. Rätsch, “Real-valued (medical)
in Rensselaer Polytechnic Institute (RPI) Computer
time series generation with recurrent conditional gans,” arXiv preprint
Science Department. He received his B.S. and M.S.
arXiv:1706.02633, 2017.
degrees at the Department of Computer Science and
[257] Z. Zhao, F. Zhou, Z. Zeng, C. Guan, and S. K. Zhou, “Meta-
Engineering from Shanghai Jiao Tong University
hallucinator: Towards few-shot cross-modality cardiac image segmenta-
in 2015 and 2018, respectively. He received his
tion,” in Medical Image Computing and Computer Assisted Intervention
Ph.D. degree in the Department of Computational
(MICCAI). Springer, 2022.
Science and Engineering at the Georgia Institute of
[258] H. Huang, K. Mehrotra, and C. K. Mohan, “Rank-based outlier
Technology in 2023. His research interest lies in
detection,” Journal of Statistical Computation and Simulation, vol. 83,
machine learning for drug discovery. Particularly, he
no. 3, pp. 518–531, 2013.
is interested in generative models on both small-
molecule & macromolecular drug design and deep representation learning
on drug development. The results of his research have been published in
leading AI conferences, including NeurIPS, ICLR, KDD, AISTATS, UAI,
Yingzhou Lu is a postdoctoral researcher at Stan- AAAI, IJCAI, and top domain journals such as Nature, Nature Chemical
ford University, she obtained her Ph.D. in Artificial Biology, Cell Patterns, and Bioinformatics.
Intelligence and Computational Biology from Vir-
ginia Tech. With eight years of expertise in machine
learning and genomics, she has cultivated a deep Wenqi Wei is a tenure-track assistant professor
understanding of genomics data analysis, graph the- in the Computer and Information Sciences Depart-
ory and multi-omics data integration. Her research ment, Fordham University. He got his PhD in the
focuses on employing advanced computational tech- School of Computer Science, Georgia Institute of
niques to analyze genomics datasets, aiming to shed Technology, and received his B.E. degree from the
light on disease development and progression. School of Electronic Information and Communica-
tions, Huazhong University of Science and Tech-
nology. His research interests include trustworthy
AI, data privacy, machine learning service, and big
data analytics. His research has appeared on major
Minjie Shen served as a Graduate Research Assis- cybersecurity, data mining, and AI venues, including
tant at the Computational Bioinformatics and Bio- CCS, CVPR, IJCAI, theWebConf, IEEE TDSC, IEEE TIFS, and ACM CSUR.
imaging Laboratory at Virginia Tech. Her research He is an associate editor of ACM Transactions on Internet Technology.
focuses on machine learning algorithms for biomed-
ical signal processing, particularly in the areas of
missing data imputation, normalization, and decon-
volution. She holds an M.S. degree in Computer
Engineering from Virginia Tech, a B.S. degree in
Computer Science from Colorado State University,
and a B.E. degree in Electrical Engineering from
East China Normal University.