SD Guide
SD Guide
Case Studies........................................................................................................... 8
2
I. Introduction to Privacy Enhancing Technology
(PET)
Privacy Enhancing Technologies (PETs) are a suite of tools and techniques that allow
the processing, analysis, and extraction of insights from data without revealing the
underlying personal or commercially sensitive data. By incorporating PETs, companies
can maintain a competitive edge in the market through leveraging their existing data
assets for innovation while complying with data protection regulations, reducing the
risk of data breaches and demonstrating a commitment to data protection. PETs are
not just a defensive measure; they are a proactive step towards fostering a culture of
data protection and securing a company's reputation in the digital age.
PETs can generally be classified into three key categories1: data obfuscation, encrypted
data processing, and federated analytics. PETs can also be combined to address varying
needs of organisations. The following Table 1 maps out the current types of PETs in
the market and their key applications.
1
Adapted from OECD, “Emerging Privacy Enhancing Technologies: Current Regulatory and Policy
Approaches,” OECD Digital Economy Papers (OECD, 2023).
3
• Computing on private
data that is not disclosed
Multi-party computation • Computing on private
(including private set intersection) data that is not disclosed
The target audience for this guide are CIOs, CTOs, CDOs, data scientists, data
protection practitioners, and technical decision-makers who may directly or indirectly
be involved in the generation and use of synthetic data.
Synthetic data is a technology that is being actively researched and developed at the
time of publication. Hence, this guide is not intended to provide a comprehensive or
in-depth review of the technology or its assessment methods. The guide is intended
to be a living document, and will be updated to ensure its recommendations remain
relevant.
2
There are two types of synthetic data: fully synthetic data and partially synthetic data. This guide
discusses the use of fully synthetic data.
3
In this guide, we generally refer to privacy risks as re-identification risks.
4
What is Synthetic Data?
Synthetic data is commonly referred to as artificial data that has been generated using
a purpose-built mathematical model (including artificial intelligence (AI)/machine
learning (ML) models) or algorithm. It can be derived by training a model (or algorithm)
on a source dataset to mimic the characteristics and structure of the source data. Good
quality synthetic data can retain the statistical properties and patterns of the source
data to a high extent. As a result, performing analysis on synthetic data can produce
results similar to those yielded with source data.
Figure 1 shows an example of how synthetic data may look like as compared with the
source data. Generated synthetic data will generally have different data points from
the source data, as seen from the tabular data. However, the synthetic data will have
statistical properties that are close to that of the source data, i.e., capturing the
distribution and structure of the source data as seen from the trend lines in Figure 1.
As such, synthetic data may not always be inherently risk-free as information about an
individual in the source dataset, or confidential data, can still be leaked due to the
resemblance of the synthetic data to the source data. There will also be trade-offs5
between data utility and data protection risks in synthetic data generation. However,
such risks can be minimised by taking data protection into consideration during the
synthetic data generation process.
4
Diagram taken with modification from Khaled El Emam, Lucy Mosquera, and Richard Hoptroff, Practical
Synthetic Data Generation (O’Reilly Media, Inc, 2020).
5
Trade-off between data utility and data protection risks is further discussed in Annex A: Step 1 and Step
3 in this guide.
5
Under What Circumstances is Synthetic Data Useful?
Synthetic data can be used in a variety of use cases ranging from generating training
datasets for AI models to data analysis and collaboration. The use of synthetic data
not only can accelerate research, innovation, collaboration, and decision-making but
also mitigate concerns about cybersecurity incidents and data breaches, enabling
better compliance with data protection/privacy regulations. Table 2 discusses a few
common use case archetypes, their key benefits, and good practices that organisations
can focus on when generating synthetic data.
6
• Synthetic data can enable data throughout the synthetic
sharing for analysis especially data generation process,
in industries and sectors, e.g., for example:
healthcare, where the source
data can be sensitive. Data preparation
• Remove outliers from
source data
Previewing • Synthetic data can be used in • Pseudonymise source
data for data exploration, analysis, and data
collaboration collaboration to provide • Employ data
stakeholders with a minimisation and
representative preview of the generalise granular data
source data without exposing
sensitive information. Synthetic data generation
• This enables stakeholders to • Add noise before or
explore and understand the after synthetic data
structure, relationships, and generation
potential insights within the
data to gain assurance of the Post synthetic data
data quality before finalising generation
any agreement or • Incorporate technical,
collaboration. contractual, and
governance measures to
mitigate any residual re-
identification risks
Use case archetype 3: Software testing
System • Organisations can use • Focus on generating
development/ synthetic data instead of synthetic data that
software production data to facilitate follows semantics e.g.,
testing software development. format, min/max values
• Use of synthetic data can help and categories, of
organisations avoid data source data instead of
breaches in the event of the the statistical
development environment characteristics and
being compromised. properties.
7
Case Studies
Solution: J.P. Morgan successfully used synthetic data for fraud detection
model training. AI models were provided with samples of normal and fraudulent
transactions to understand the tell-tale signs of suspicious transactions.
6
J. P. Morgan, “Synthetic Data for Real Insights,” Technology Blog, n.d., https://www.jpmorgan.com/
technology/technology-blog/synthetic-data-for-real-insights
7
Contributed by Mastercard
8
(C) Safeguarding patient data for data analysis8
Problem: Prior to utilising synthetic data, Johnson & Johnson (J&J) allowed
external researchers or consortia to access healthcare data for research
proposals validated by J&J. To safeguard patient privacy, the data was
transformed into anonymised healthcare data. However, feedback received
indicated that the overall usefulness of the anonymised data, which relied on
traditional anonymisation techniques, was not always satisfactory and did not
always meet the requirements of the researchers or consortia.
Benefit: This allowed the pharmaceutical company to preview the data and be
assured of the data quality prior to the high-value purchase and access to the
actual data.
8
Contributed by Johnson & Johnson (J&J)
9
Contributed by A*STAR
9
III. Recommendations
Synthetic data has the potential to drive the growth of AI/ML by enabling AI model
training while protecting the underlying personal data. It also addresses dataset
related challenges for AI model training, such as insufficient and biased data, through
enabling the augmentation and increased diversity of training datasets.
In addition, synthetic data can be used to facilitate and support organisations’ data
analytics, collaboration and software development needs. An added benefit of using
synthetic data in place of production data to facilitate software development is that
data breaches can be avoided in the event the development environment is
compromised.
10
Annex A: Handbook on Key Considerations and
Best Practices in Synthetic Data Generation
In this handbook, we describe the key considerations and best practices for
organisations to reduce re-identification risks of synthetic tabular data through a five-
step approach.
For any other complex synthetic datasets that are unstructured, organisations are
advised to consider hiring synthetic data experts, data scientists or independent risk
assessors to assess and mitigate the risks of the generated synthetic data.
11
• Where relevant, organisations should also put in place proper contractual
obligations on recipients of synthetic data where necessary to prevent re-
identification attacks on the data.
With this knowledge, the management and data owner, with the help of relevant
stakeholders such as the data analytics team, should establish objectives prior to
synthetic data generation to determine an acceptable risk threshold10 of the generated
synthetic data and the expected utility of the data. This will help provide organisations
with the appropriate benchmarks to assess any trade-offs between data protection
risks and data utility.
When preparing the source data 13 for generating synthetic data, it is important to
consider the following:
• What are the key insights that needed to be preserved in the synthetic data?
• Which are the necessary data attributes for the synthetic data to meet the
business objectives?
10
The re-identification risk threshold represents the level of re-identification risk that is acceptable for
a given synthetic dataset. There is currently no universally accepted numerical value for risk threshold.
For further details refer to Step 4 (Assess re-identification risks).
11
Organisations may refer to ISO27001 for more information on developing an enterprise risk
management framework.
12
An example of this is PDPC’s Guide to Data Protection Impact Assessments. A DPIA is applicable in the
case where personal data is involved. The DPIA may not be relevant in situations where the synthetic
data generation does not involve personal data processing.
13
This step assumes that the source data has been properly cleaned (such as fixing or removing
incorrect, corrupted, incorrectly formatted, duplicate, or incomplete data) and is of acceptable quality
for the generation of synthetic data.
12
Understanding key insights to be preserved
To ensure that the synthetic data can meet the business objectives, organisations need
to understand and identify the trends, key statistical properties, and attribute-
relationships in the source data that need to be preserved for analysis e.g., identify
relationships between demographic characteristics of population and their health
conditions.
Organisations should consider, at this point, whether outlier trends and insights are
necessary to be preserved for the business objectives. Key considerations could include
the following:
• If outliers are not necessary to meet the business objectives and the risk of re-
identification is high, organisations should consider removing the outliers. This
can be done prior to synthetic data generation or at subsequent stages of the
synthetic data generation.
Based on the key insights needed, organisations should apply data minimisation to
extract only the relevant data attributes from the source data. Thereafter, remove or
pseudonymise all direct identifiers14 from the extracted data.
14
Refer to PDPC’s Guide to Basic Anonymisation on how to identify direct identifiers in a dataset.
13
information into height and weight bands to reduce the possibility of height and
weight combinations being used to identify any outliers.
Organisations should also standardise and document the details on each data attribute
(such as data definitions, standards, metrics etc.) in a data dictionary. This enables the
organisation to subsequently validate the integrity of the generated synthetic data to
detect anomalies and fix any data inconsistencies. Refer to the following checklist in
Table 3 for key considerations.
15
The use of differential privacy to add noise to synthetic data is widely discussed as a mechanism to
reduce re-identification risks. However, there is currently no universal standard on how to implement
differential privacy. Moreover, the noise added may also reduce the utility of the synthetic data, making
it less accurate or useful for certain types of analysis.
14
Step 3: Generate synthetic data
There are many different methods 16 to generate synthetic data, for example,
sequential tree-based synthesisers, copulas, and deep generative models (DGMs).
Organisations need to consider which methods are most appropriate, based on their
use cases, data objectives, and types of data. Please refer to Annex C for more
information on these synthetic data generation methods. Thereafter, organisations
may consider splitting the source data into two separate sets e.g., 80% as training
dataset, and 20% as control dataset 17 for assessing re-identification risks of the
synthetic data.
After generating synthetic data, it is a good practice for organisations to perform the
following checks on the quality of the generated synthetic data:
• Data integrity
• Data fidelity
• Data utility
Data integrity
Data integrity ensures the accuracy, completeness, consistency, and validity of the
synthetic data as compared with the source data. Organisations can validate the
integrity of the generated synthetic data against the dictionary of the source data.
Data fidelity
Data fidelity examines if synthetic data closely follows the characteristics and statistical
attributes of the source data. There are a few metrics for measuring data fidelity and
they are typically done by statistically comparing the generated synthetic data directly
with the source data. Organisations should use the performance metric(s) for data
fidelity18 (see Table 4) that best meet their data objectives.
16
This guide may not be comprehensive in covering all other synthetic data generation methods such
as Bayesian model and variational autoencoders (VAE).
17
Refer to Approach 2 in Annex E for more details on the assessment and evaluation framework for
quantifying re-identification risk.
18
There are other generic metrics described here in addition to those listed in Table 4. See Khaled El
Emam et al., “Utility Metrics for Evaluating Synthetic Health Data Generation Methods: Validation Study,”
JMIR Medical Informatics 10, no. 4 (2022)
15
Table 4: Performance metrics for data fidelity
Performance metrics generally used for assessing data fidelity
Histogram-based Measures the similarity between source and synthetic
similarity data’s distributions through a histogram comparison
of each feature. This ensures the synthetic data
preserves important statistical properties such as
central tendency (mean, median), dispersion
(variance, range), and distribution shape (skewness,
kurtosis).
Correlational similarity Measures the preservation of relationships between
features in the source and synthetic datasets. For
example, if higher education typically leads to higher
income in the source data, this pattern should also be
evident in synthetic data.
Data utility
Data utility refers to how well synthetic data can replace or add to source data for the
specific data objective of the organisation.
There are different approaches to evaluate the utility of synthetic data. The true test of
utility is how it performs in real-world tasks. One common approach to check this is by
training identical AI/ML models on synthetic and training data. The performances from
the two models are compared with the control dataset, simulating testing in the
production environment, to assess the utility of the synthetic data. Examples of
performance metrics generally used include “accuracy”, “precision”, “recall”, “F1-Score”,
or “Area Under the ROC Curve (AUC-ROC)” for classification tasks, and “Mean Absolute
Error (MAE)” or “Mean Squared Error (MSE)” for regression tasks 19 (see definition in
Table 5 below). If their compared scores are close, then it indicates that the synthetic
data has high utility. In simple terms, a high utility score means that machines trained
on synthetic data work similarly to those trained on training data.
When trying to maximise the utility of data, there is often an inherent trade-off
between data utility and data protection. Thus, a fine balance between data utility and
19
There is another performance metric suitable for regression tasks, i.e., replicability, which is used for
assessing data utility and is described here in addition to those listed in Table 5. See Khaled El Emam et
al., “An Evaluation of the Replicability of Analyses Using Synthetic Health Data,” Scientific Reports 14
(2024), https://www.nature.com/articles/s41598-024-57207-7
16
data protection needs to be achieved through an iterative process (Steps 3 and 4) to
synthesise data up to an acceptable level for re-identification risks while finding the
right balance of data utility.
20
A type of average that gives more weight to lower values of precision and recall scores.
17
Mean Measures the model's errors in prediction by averaging the squares of
Squared Error the errors between predicted and actual values. MSE heavily penalises
(MSE) larger errors more than smaller ones, due to squaring the error values.
This makes it more sensitive to outliers and large errors. It is calculated
as the mean of the squared differences between actual and predicted
values.
v. Select relevant performance metrics that meets data objectives to measure data
utility.
Generally, re-identification (or privacy) risk assessment for synthetic data is an attack-
based evaluation. It evaluates how successful an adversary, who carries out re-
dentification attacks through singling out attacks, linkability attacks and inference
attacks (as described in Annex D) on synthetic datasets, can determine if an individual
belongs to the source dataset (i.e., membership inference) and/or derive details of an
individual from the source dataset which are otherwise undisclosed (i.e., attribute
inference). The goal for organisations is to ensure that the re-identification risk levels
for the three key re-identification attacks are acceptable. If re-identification risk level
is unacceptable, repeat Step 3 to re-generate synthetic data to meet the acceptable
18
risk level. This can be achieved by applying more data protection controls on the
source data, e.g., generalising the data or adding noise (see “Checklist for data
preparation” in Table 3).
While there is no universally accepted numerical threshold value for risk level, some
organisations 21 have chosen to align their re-identification risk level with existing
industry guidelines and recommendations for de-identified/anonymised data (see
Table 7). However, organisations should take note that the computation method for
re-identification threshold in a de-identified/anonymised dataset is very different from
that for a synthetic dataset. Nevertheless, the fundamental basis for both is that the
re-identification/privacy risk assessment is a probabilistic measurement.
21
Samer El Kababji et al., “Evaluating the Utility and Privacy of Synthetic Breast Cancer Clinical Trial Data
Sets,” JCO Clinical Cancer Informatics 7 (2023), https://ascopubs.org/doi/full/10.1200/CCI.23.00116
22
European Medicines Agency, “European Medicines Agency Policy on Publication of Clinical Data for
Medicinal Products for Human Use,” 2019, https://www.ema.europa.eu/en/documents/other/policy-70-
european-medicines-agency-policy-publication-clinical-data-medicinal-products-human-use_en.pdf
23
Health Canada, “Guidance Document on Public Release of Clinical Information: Profile Page,” 2019,
https://www.canada.ca/en/health-canada/services/drug-health-product-review-approval/profile-
public-release-clinical-information-guidance.html
19
ISO/IEC 27559 ISO/IEC 27559 summarises a list of example thresholds
Privacy enhancing data de- providing a range of acceptable values which
identification framework. encompasses 0.09.
In this final step, organisations should identify all potential residual risks and
implement appropriate mitigation controls (technical, governance, and contractual) to
minimise the identified risks. These risks and controls should be documented and
approved by the management and key stakeholders as part of the organisation’s
enterprise risk framework.
Organisations can take into consideration the following risks as part of risk assessment.
New insights may be learnt about the source data by analysing the synthetic dataset
alone or in combination with other available datasets. Organisations should assess if
these insights may be sensitive or could misinform.
In determining the source dataset for synthetic data, it is important to also consider
the sampling fraction from the population dataset, which is the ratio of the sample size
within the source data as compared to the population size. For example, an adversary
will have a lower chance of predicting whether a target group of individuals from a
population is included in a synthetic dataset that is trained from a source dataset
sampled from 20% of the population, as compared with a source dataset sampled from
90% of the population.
20
Parties receiving synthetic data
The receiving parties of the synthetic data, including any data intermediaries, may pose
data breach compliance risks when handling synthetic data. Organisations should
assess the data recipient’s ability and motivation to re-identify individuals from the
dataset. A data recipient who possesses specialised skillsets or technologies may be
able to combine special knowledge or get public knowledge to re-identify any
individual from the dataset. Such risks must be accounted for in the risk assessment
exercise.
Changing environment
The likelihood of re-identification risks of any given synthetic dataset increases over
time, due to increase in computing power and improvement in data-linking techniques.
Model leakage
A model that has been trained using source data to generate the synthetic data can
be susceptible to a malicious attack by adversary to reconstruct (parts of) the source
data.
The following Table 8 lists examples of best practices that organisations can consider
implementing to manage residual risks posed by using synthetic data.
Table 8. Best practices and security controls to implement and manage risks
Governance Access controls Implement access control for the source
data and synthetic data generator model.
Apply access control to synthetic data where
the re-identification or residual risk is high,
especially if the data contains highly
sensitive information or insights.
Asset management Properly label synthetic data to prevent
human error when managing both source
data and synthetic data.
Risk management Periodically conduct re-identification risk
reviews of synthetic datasets, especially if
these are publicly released.
Legal controls Have in place contractual agreements to
outline the responsibilities of third-party
21
recipients of the synthetic data and/or
models as well as any third-party solution
providers who provide the synthetic data
generation tools. This includes safeguarding
the data/model and prohibiting attempts to
re-identify individuals.
Incident management
Organisations should identify the risks of data breaches involving synthetic data,
synthetic data generator model, and model parameters, and incorporate relevant
scenarios into their incident management plans. The following considerations may be
relevant for organisations’ internal investigations24:
Loss of fully synthetic data (for synthetic data that is not intended for public
release)
Fully synthetic data that has data protection best practices incorporated in its
generation process and has been assessed to have a low re-identification risk is
generally not considered personal data. However, organisations should still proceed
to investigate the incident to understand the root cause and improve its internal
24
For data breach reporting to PDPC, organisations will have to assess if it is a notifiable breach based
on PDPA’s Data Breach Notification obligation.
22
safeguards against such occurrences in the future. Organisations should also monitor
if there is any evidence of actual re-identification and assess if it would be a notifiable
data breach to PDPC.
Both the synthetic data generator model and its parameters can provide useful
information to an adversary to perform a model inversion attack. With access to
generated synthetic data, it may further enhance the adversary’ ability to recover the
source data. Organisations should proceed to investigate the incident to understand
the root cause so as to improve its internal safeguards. It should also monitor for a
possible successful model inversion attack which may result in the reconstruction and
disclosure of the source data. Where such reconstruction and disclosure of source data
is detected, organisations will have to assess if such breach would be notifiable.
23
Annex B: Data Dictionary Format
The following is a sample of data dictionary format:
24
CODINGS Example 1: If TYPE is ‘date’, Take special notice of
use excel convention to capital/small letters to avoid
indicate date format, e.g., confusion.
dd/mm/yyyy, mm-dd-yyyy,
etc.
Example 3: If TYPE is
‘numeric’, specify range. E.g.,
[0,100] OR (3,4).
FREQUENCY For Example 1: BASELINE; 6 Leave blank if not
longitudinal WEEK; 6 MONTH longitudinal data.
data. Use to
indicate if the Example 2: VISIT 1; VISIT 2
variable is
collected
during a
particular visit
type.
CATEGORY Use to group Example. 1: DEMOGRAPHICS
the variable
under a Example 2: ECHO
specific
category. Example 3: LIFESTYLE
VARIABLE
25
If yes, explain how the
variable was computed from
other variables such as bmi
formula ir diagnosis
standard/criteria etc, either
in the CONSTRAINTS or
REMARKS column.
CONSTRAINTS How the Example 1: ‘Head_circ’ (head This information will help
variable is circumference) is a variable data users decide if a value is
dependent collected for ‘age’ <= 6. missing/unknown (should be
on other Leave empty if ‘age’ > 6. collected but not collected),
variables. or not applicable (not
Example. 2: Collected only collected because of
for data cohort ‘<COHORT procedure).
NAME>’ or hospital
‘<HOSPITAL A>’. Note that the value of a
variable might be dependent
Example 3: ‘Ever_pregnant’ (or conditional) on other
only collected for females variables, but it is not
above age of 12. If ‘male’ or necessarily derived from
‘female’ below age of 12, other variables;
recorded as ‘N.A.’ If ‘female’ CONSTRAINTS and
above age of 12, either ‘YES’, SECONDARY are
‘NO’, or ‘UNKNOWN’. complementary, but the
Example 4: ‘BMI’ only former does not imply the
computable if ‘height’ and latter.
‘weight’ are also collected.
Leave blank if either value is
blank.
REMARKS Additional Example. 1: How categorical It is often necessary to leave
comments, variables are encoded as a note to remind data
such as how integers: 1=NO, 0=YES, - owners/users of the
the data is 1=N.A. difficulties encountered
encoded, during data collection, the
and/or Example 2: Sensitive OR self- corresponding response, and
concerns reported variable, etc. associated concerns. Some
related to the of these remarks can be
variable. Example 3: Metric unit used included in the variable
for collection, ‘cm’, ‘m’, description, or here, if they
‘inches’, etc. are deemed miscellaneous.
26
Annex C: Examples of Methods of
Synthetic Data Generation
Statistical Methods
Contributed by Betterdata.ai
Bayesian networks (BN) are probabilistic models that use a directed acyclic
graph (DAG) to depict conditional dependencies between variables, enabling
the generation of synthetic data statistically similar to the original data. BNs are
helpful in sectors like healthcare and finance where accurate data relationships
are essential. Typically, BNs require significant domain expertise for precise
modelling via an expert-driven approach 25 . Alternatively, they can also be
structured through data-driven methods, although these compromise accuracy
due to less reliable inferences about the underlying data relationships.
25
Anthony Costa Constantinou, Norman Fenton, and Martin Neil, “Integrating Expert Knowledge with
Data in Bayesian Networks: Preserving Data-Driven Expectations When the Expert Variables Remain
Unobserved,” Expert Systems with Applications 56 (2016): 197–208,
https://www.sciencedirect.com/journal/expert-systems-with-applications/vol/56/suppl/C
26
Ergute Bao et al., “Synthetic Data Generation with Differential Privacy via Bayesian Networks,” Journal
of Privacy and Confidentiality 11, no. 3 (2021), https://dr.ntu.edu.sg/handle/10356/164213
27
Ole J. Mengshoel, “Understanding the Scalability of Bayesian Network Inference Using Clique Tree
Growth Curves,” Artificial Intelligence 174, no. 12–13 (2010): 987–1006,
https://ntrs.nasa.gov/api/citations/20090033938/downloads/20090033938.pdf
27
computational demands, they may reduce accuracy. Therefore, BNs are
favoured for scenarios that require interpretability but less for high-dimensional
datasets where deep learning offers a more practical solution due to its ability
to efficiently handle large-scale data.
(B) Conditional-Copulas
Conditional-Copulas are best suited for synthetic data generation when the
training datasets are moderately sized, often generating time-efficient and
robust replication of the required data joint distributions. As compared to
relatively costly machine-learning methods, which as a data-driven process is
much reliant on the cardinality and size of the available training data, copulas
provide a cost-effective alternative that balances data availability with prior
expert knowledge, generating diverse sample sets based on pre-determined
conditions for methodology testing and algorithm training.
28
(C) Marginal-Based Data Synthesis
Gender Occupation
… …
Marginal of T on {Gender, Occupation}
28
“Bayesian Network,” Wikipedia, 2024, https://en.wikipedia.org/wiki/Bayesian_network
29
• Privacy: the data synthesis process could offer strong privacy protection,
if noise is carefully introduced during the selection and construction of
marginals and the training of the statistical model.
One way to generate synthetic data is to apply decision tree sequentially built
on commonly used regression and classification trees (“CART”) algorithms,
although variants (e.g., boosted trees) of these can also be used. The principle
29
Jun Zhang et al., “PrivBayes: Private Data Release via Bayesian Networks,” in Proceedings of the 2014
ACM SIGMOD International Conference on Management of Data, 2014, 1423–34,
https://dl.acm.org/doi/10.1145/2588555.2588573
30
Ryan McKenna, Gerome Miklau, and Daniel Sheldon, “Winning the NIST Contest: A Scalable and
General Approach to Differentially Private Synthetic Data,” Journal of Privacy and Confidentiality 11, no.
3 (2021), https://journalprivacyconfidentiality.org/index.php/jpc/article/view/778
31
Kuntai Cai et al., “Data Synthesis via Differentially Private Markov Random Fields,” Github, n.d.,
https://github.com/caicre/PrivMRF
32
National Institute of Standards and Technology, “Disassociability Tools,” NIST, 2023, https://
www.nist.gov/itl/applied-cybersecurity/privacy-engineering/collaboration-space/focus-areas/de-id/
tools#dpchallenge
33
National Institute of Standards and Technology, “2020 Differential Privacy Temporal Map Challenge,”
NIST, 2022, https://www.nist.gov/ctl/pscr/open-innovation-prize-challenges/past-prize-challenges/
2020-differential-privacy-temporal
34
SAP Community, “SAP Data Intelligence: Data Synthesizer for Machine Learning Operator,”
Technology Blogs by SAP, 2021, https://community.sap.com/t5/technology-blogs-by-sap/sap-data-
intelligence-data-synthesizer-for-machine-learning-operator/ba-p/13501498
35
“Reprosyn: Synthesising Tabular Data,” Github, 2022, https://github.com/alan-turing-institute/
reprosyn; “Synthcity,” Github, 2024, https://github.com/vanderschaarlab/synthcity; “DataSynthesizer,”
Github, 2023, https://github.com/DataResponsibly/DataSynthesizer; DataCebo, “SDGym,” Github, 2024,
https://github.com/sdv-dev/SDGym; “DPART | Differentially Private Auto-Regressive Tabular,” Github,
2024, https://github.com/hazy/dpart
30
is to sequentially synthesise variables using classification and regression
models.36
Generative Adversarial Networks (GANs) are deep generative models that excel
in synthesising complex, high dimensional datasets. Through an adversarial
process, the generator creates synthetic data which a discriminator evaluates
for realism, prompting a continual improvement in the synthetic output. This
iterative refinement enables GANs to produce synthetic data that closely
resembles the original, outperforming non-deep learning techniques in
complex real-world datasets.
GANs also demonstrate the ability to handle different data structures commonly
found in enterprise settings. The development of specialised models like
CTGAN and CTABGAN+ for static tabular data, TimeGAN for time series data
and IRG for relational data highlights the adaptability of GANs in diverse data
settings.
36
For more information, refer to Khaled El Emam, Lucy Mosquera, and Richard Hoptroff, “Evaluating
Synthetic Data Utility,” in Practical Synthetic Data Generation Balancing: Privacy and the Broad
Availability of Data (O’Reilly Media, Inc, 2020).
37
Khaled El Emam, Lucy Mosquera, and Chaoyi Zheng, “Optimizing the Synthesis of Clinical Trial Data
Using Sequential Trees,” Journal of the American Medical Informatics Association 28, no. 1 (2020): 3–13.
31
complex relationships within data, making them ideal for creating synthetic
datasets that mirror the complexity of the real world.
LLMs also excel when original data is limited. Leveraging extensive pre-trained
knowledge to fill in gaps in sparse original data and generate rich data in data
scarce environments. However, while LLMs offer remarkable capability in
tabular data synthesis, they require substantial computational power and time
to train, presenting a trade-off.
32
Annex D: Re-identification Risks
As synthetic data generally tries to retain the statistical properties and characteristics
of its source data, adversaries can attempt to re-identify or extract sensitive
information about an individual from the synthetic data. The following describes the
different types of re-identification attacks (commonly referred to as privacy attacks) on
synthetic datasets.
Singling out attack is generally conducted for outliers, e.g., unique attribute(s),
rare data attribute(s) or unique combination of attributes. As the generated
synthetic datapoints attempt to reflect or capture the presence and
characteristics of such outliers, they offer a heightened possibility of singling
out unique data records, and outliers are especially susceptible. While singling
out may not represent a re-identification risk by itself, it may allow the adversary
to gain information about the data record through using related datasets or
other background information (see example in linkability attack).
For a linkability attack to occur, the adversary is assumed to have access to two
sets of data i.e., (i) synthetic data and (ii) other publicly available data or private
datasets where the adversary has privileged access. In a linkability attack, the
adversary attempts to determine if any data points from the two data sets
belong to the same individual, or group of individuals.
33
different datasets. Intuitively, the adversary’s chances of a successful attack are
likely to improve when data utility of the generated synthetic data increases, i.e.,
the closer it resembles the statistical characteristics of the source data, the
higher the chance of a successful attack.
For instance, a successful attack occurs when an adversary can infer with high
confidence that an 86 years-old male with diabetes (from the source dataset of
the community hospital) has other medical complications such as hypertension.
Importantly, this observation can apply to any person belonging to the same
distribution (e.g., males above 80 years of age with diabetes), even when his
data has never been used for training.
34
Annex E: Examples of Approaches to
Evaluate Re-identification Risks
This annex introduces different approaches to evaluate re-identification/privacy risks
adopted by three industry members. These approaches can be applied to synthetic
data regardless of the generation method used.
(A) Approach 1
For computation of the attribution disclosure, the following article describes the
process in depth: https://www.jmir.org/2020/11/e23139/.
For the membership disclosure, the following article describes the details of the
calculation: https://academic.oup.com/jamiaopen/article/5/4/ooac083/675849
2?searchresult=1.
38
Khaled El Emam, Lucy Mosquera, and J. Bass, “Evaluating Identity Disclosure Risk in Fully Synthetic
Health Data: Model Development and Validation,” Journal of Medical Internet Research 22, no. 11 (2020):
e23139.
39
Khaled El Emam, Lucy Mosquera, and Xi Fang, “Validating A Membership Disclosure Metric For
Synthetic Health Data,” Journal of the American Medical Informatics Association 5, no. 4 (2022): 00ac083.
35
References
Emam, Khaled El, Lucy Mosquera, and J. Bass. “Evaluating Identity Disclosure Risk in
Fully Synthetic Health Data: Model Development and Validation.” Journal of Medical
Internet Research 22, no. 11 (2020): e23139.
Emam, Khaled El, Lucy Mosquera, and Xi Fang. “Validating A Membership Disclosure
Metric For Synthetic Health Data.” Journal of the American Medical Informatics
Association 5, no. 4 (2022): 00ac083.
Emam, Khaled El, Lucy Mosquera, Xi Fang, and Alaa El-Hussuna. “Utility Metrics for
Evaluating Synthetic Health Data Generation Methods: Validation Study.” JMIR Medical
Informatics 10, no. 4 (2022).
Kababji, Samer El, Nicholas Mitsakakis, Xi Fang, Ana-Alicia Beltran-Bless, Greg Pond,
Lisa Vandermeer, Dhenuka Radhakrishnan, and Khaled El Emam. “Evaluating the Utility
and Privacy of Synthetic Breast Cancer Clinical Trial Data Sets.” JCO Clinical Cancer
Informatics 7 (2023). https://ascopubs.org/doi/full/10.1200/CCI.23.00116
Yang, S. “Process Mining the Trauma Resuscitation Patient Cohorts.” In 2018 IEEE
International Conference on Healthcare Informatics (ICHI), 29–35, 2018.
(B) Approach 2
Contributed by A*Star
36
To that end, two separate attacks were performed, namely the (i) control attack
and the (ii) main attack. The control attack targets the control dataset and
measures patterns common to the whole population; the main attack targets
the training dataset and measures patterns common to the whole population
and possible biases towards the training dataset. The computed asymmetry
between the two attacks provides a fair measurement of how effective the
synthetic data is in differentiating individuals in the training dataset from the
larger population, while grounding the obtained privacy-risk metric with some
reasonable baseline from which to make further interpretations.
Lastly, the framework also measures a “naïve” baseline that assumes no prior
knowledge of the synthetic dataset and is therefore, entirely dependent on luck.
This closes a loophole where one might erroneously assume that the generated
synthetic dataset is risk-free because it has extremely poor fidelity/utility and/or
when the designed inference/linkabililty attacks or synthetic data is insensible
in the first place. In these scenarios, the “naïve” attack might outperform the
other two attacks, indicating that the test is flawed.
The computed asymmetry between the main and control attack is normalised
to obtain a privacy risk leakage metric, known as “R”. This value is bounded
between 0 and 1 and increases with the risks of privacy leakage. It is reasonable
to first decide on an acceptable threshold value of “R” before generating the
synthetic data; reversal of this process exposes one to considerable latitude in
justifying one’s product. The said threshold can be fixed based on policy, and
further mitigated based on the sensitivity of the training dataset and the
availability of the generated synthetic dataset.
It is crucial to note that the privacy risks are evaluated with respect to the
individuals in the training database, and not the wider public. As such, privacy
is compromised when an adversary finds it easier to (i) determine if an individual
belongs to the training database and (ii) derive details of an individual from the
training database otherwise undisclosed.
References
For more details, a description of the framework and the attack algorithms can be
found in the paper by M. Giomi et al. “A Unified Framework for Quantifying Privacy
Risk in Synthetic Data.” In Proceedings on Privacy Enhancing Technologies Symposium
(PETS 2023), 2023.
37
(C) Approach 3
Contributed by Betterdata.ai
40
Cynthia Dwork et al., “Calibrating Noise to Sensitivity in Private Data Analysis,” in Theory of
Cryptography. TCC 2006. Lecture Notes in Computer Science, Vol 3876, ed. S. Halevi and T. Rabin (Berlin:
Springer, 2006); Cynthia Dwork and Aaron Roth, “The Algorithmic Foundations of Differential Privacy,”
Foundations and Trends® in Theoretical Computer Science 9, no. 3–4 (2014): 211–407.
41
Thomas Steinke, Milad Nasr, and Matthew Jagielski, “Privacy Auditing with One (1) Training Run,” in
NIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems, ed.
A. Oh, T. Naumann, and A. Globerson (Curran Associates Inc., 2023), 49268–80, https://dl.acm.org/
doi/10.5555/3666122.3668265
42
TAPAS, “Welcome to TAPAS’s Documentation!,” tapas, 2022, https://tapas-privacy.readthedocs.io/
en/latest/index.html
38
Organization Data Type DP Budget (ε) Collection Purpose of Data
Name Period Collection
Apple [5,6] Health Data 2.0 2017-2024 Analytics
Safari 4.0
Emoji 4.0
QuickType 8.0
2020 US Census Housing Unit 2.47 2020 Deciding Fund
Data [7,8] Data Distribution,
Person’s File 17.14 Assisting States
References
Dwork, Cynthia, and Aaron Roth. “The Algorithmic Foundations of Differential Privacy.”
Foundations and Trends® in Theoretical Computer Science 9, no. 3–4 (2014): 211–407.
Steinke, Thomas, Milad Nasr, and Matthew Jagielski. “Privacy Auditing with One (1)
Training Run.” In NIPS ’23: Proceedings of the 37th International Conference on Neural
Information Processing Systems, edited by A. Oh, T. Naumann, and A. Globerson,
49268–80. Curran Associates Inc., 2023. https://dl.acm.org/doi/10.5555/
3666122.3668265
39
Tang, Jun, Aleksandra Korolova, Xiaolong Bai, Xueqiang Wang, and Xiaofeng Wang.
“Privacy Loss in Apple’s Implementation of Differential Privacy on MacOS 10.12.”
ArXiv:1709.02753, 2017. https://arxiv.org/abs/1709.02753
40
ACKNOWLEDGEMENTS
The PDPC and Infocomm Media Development Authority (IMDA) sincerely extends their
appreciation for the editorial contributions in the development of this publication from
the following:
• Betterdata.ai
PDPC and IMDA also express their appreciation and acknowledgment for all the
valuable feedback received from the following organisations:
• Mastercard
Agencia Espanola Proteccion Datos. “Synthetic Data and Data Protection.” Blog, 2023.
https://www.aepd.es/en/prensa-y-comunicacion/blog/synthetic-data-and-data-
protection
41
Information Commissioner’s Office (U.K.). “Chapter 5: Privacy-Enhancing
Technologies (PETs).” ICO call for views: Anonymisation, pseudonymisation and
privacy enhancing technologies guidance, 2022. https://ico.org.uk/about-the-ico/ico-
and-stakeholder-consultations/ico-call-for-views-anonymisation-pseudonymisation-
and-privacy-enhancing-technologies-guidance/
———. “G7 DPAs’ Emerging Technologies Working Group Use Case Study on Privacy
Enhancing Technologies.” UK GDPR guidance and resources, n.d.
https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/data-
sharing/privacy-enhancing-technologies/case-studies/g7-dpas-emerging-
technologies-working-group-use-case-study-on-privacy-enhancing-technologies/
END OF DOCUMENT
JOINTLY DEVELOPED BY
Copyright 2024 – Personal Data Protection Commission Singapore (PDPC) and Agency for Science, Technology and
Research Singapore
The contents herein are not intended to be an authoritative statement of the law or a substitute for legal or other
professional advice. The PDPC and its members, officers, employees and delegates shall not be responsible for any
inaccuracy, error or omission in this publication or liable for any damage or loss of any kind as a result of any use of or
reliance on this publication.
The contents of this publication are protected by copyright, trademark, or other forms
of proprietary rights and may not be reproduced, republished, or transmitted in any
form or by any means, in whole or in part, without written permission.
42