0% found this document useful (0 votes)
50 views28 pages

A Survey On Differential Privacy For Unstructured Data Content

Huge amounts of unstructured data including image, video, audio, and text are ubiquitously generated and shared, and it is a challenge to protect sensitive personal information in them, such as human faces, voiceprints, and authorships. Differential privacy is the standard privacy protection technology that provides rigorous privacy guarantees for various data. This survey summarizes and analyzes differential privacy so- lutions to protect unstructured data content before it is shared with untru

Uploaded by

zoe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views28 pages

A Survey On Differential Privacy For Unstructured Data Content

Huge amounts of unstructured data including image, video, audio, and text are ubiquitously generated and shared, and it is a challenge to protect sensitive personal information in them, such as human faces, voiceprints, and authorships. Differential privacy is the standard privacy protection technology that provides rigorous privacy guarantees for various data. This survey summarizes and analyzes differential privacy so- lutions to protect unstructured data content before it is shared with untru

Uploaded by

zoe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 28

A Survey on Differential Privacy for Unstructured

Data Content
YING ZHAO and JINJUN CHEN, Swinburne University of Technology, Australia

Huge amounts of unstructured data including image, video, audio, and text are ubiquitously generated
and shared, and it is a challenge to protect sensitive personal information in them, such as human faces,
voiceprints, and authorships. Differential privacy is the standard privacy protection technology that provides
rigorous privacy guarantees for various data. This survey summarizes and analyzes differential privacy so-
lutions to protect unstructured data content before it is shared with untrusted parties. These differential
privacy methods obfuscate unstructured data after they are represented with vectors and then reconstruct
them with obfuscated vectors. We summarize specific privacy models and mechanisms together with possible
challenges in them. We also discuss their privacy guarantees against AI attacks and utility losses. Finally, we
discuss several possible directions for future research.

CCS Concepts: • Security and privacy → Software and application security; • Information systems →
Information retrieval; • Theory of computation → Theory and algorithms for application domains;
• Mathematics of computing → Probability and statistics;

Additional Key Words and Phrases: Differential privacy, unstructured data content privacy, privacy protected
unstructured data, image, voiceprint, text, video

ACM Reference format:


Ying Zhao and Jinjun Chen. 2022. A Survey on Differential Privacy for Unstructured Data Content. ACM
Comput. Surv. 54, 10s, Article 207 (September 2022), 28 pages.
https://doi.org/10.1145/3490237

1 INTRODUCTION
Massive amounts of unstructured data including image, video, audio, and text are generated every
day from a wide range of sources. Images, videos, and audio are ubiquitously generated via smart
devices, such as personal cameras, smartphones, and various intelligent personal assistants. Text
documents are ubiquitously generated in customer surveys, emails, hospitals, companies, govern-
ments, and more (too many to list). These unstructured data are also widely shared to untrusted
parties. However, there is considerable personal sensitive information involved in them, such as
human faces, individual identities, vehicle license plates, voiceprints, and personal health records,
as well as personal preferences for words. Of particular concern is that they are all kind of unique
characteristics for individuals. Once they are released, users’ privacy would be leaked forever. It

This paper is partly supported by Australian Research Council (ARC) projects DP170100136, LP180100758, DP190101893.
Authors’ address: Y. Zhao and J. Chen, Swinburne University of Technology, Melbourne, Australia; emails: yingzhao@
swin.edu.au, [email protected].
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2022 Association for Computing Machinery.
0360-0300/2022/09-ART207 $15.00
https://doi.org/10.1145/3490237

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207
207:2 Y. Zhao and J. Chen

Fig. 1. General data obfuscation flow for content of unstructured data including image, audio, video, and
text on the user side.

is therefore a challenge to protect sensitive content in unstructured data before it is shared with
untrusted parties.
Different Privacy (DP) [23] has been a de facto standard for a variety of data types in the
privacy protection domain over the past decade, due to its rigorous privacy definition and low
computational overhead. It guarantees that any pair of secrets cannot be distinguished by observ-
ing outputs independently of arbitrary background information by adversaries. It functions by
adding well-calibrated noise to individual data or database queries. Its original model has been
extended to many variants [19] for specific data scenarios, such as Metric DP [11], local DP [20],
shuffled DP [5, 33], and hybrid DP [4]. It has shown great effectiveness in protecting various sen-
sitive information, including consumers’ survey reports [108], smartphone application usage [50],
locations [79, 95, 113], genome-wide association studies [97], and eye-tracking data [92].
Recently, with the advent of General Data Protection Regulation (GDPR) and increasing
privacy concerns regarding sensitive content in unstructured data, including image, video, audio,
and text, many works have proposed differential privacy-based approaches to provide rigorous pri-
vacy guarantees for them. This survey summarizes these privacy approaches in protecting unstruc-
tured data content before it is shared with untrusted parties. We identify that these methods model
and represent unstructured data with vectors before they are obfuscated. Specifically, as shown in
Figure 1, unstructured data are vectorized with reversible transformations such as Singular Value
Decomposition (SVD) and word embeddings. The vectorization results including real-valued or
bit-valued vectors are then obfuscated with conventional DP methods. Finally, these obfuscated
vectors are inversely vectorized and used to construct obfuscated unstructured data. This survey
also provides privacy and utility analysis on these methods as well as their pros and cons, followed
by several possible directions for future research.

1.1 Outline and Survey Overview


Over the past decade, several surveys have summarized differential privacy technology and its
applications, outlined in Table 1:
(1) The initial survey by Dwork [21] recalled DP definitions and mechanisms for data analysis
including statistical data inference, contingency table release, and half-space queries.
(2) In the report [22], Dwork pointed out several potential interdisciplinary areas related to DP,
such as robust statistics, a subfield of statistics.
(3) The review [18] summarized DP preliminaries and applications in the context of healthcare
data and highlighted its limitations and respective possible opportunities in practical
scenarios.

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:3

Table 1. Summary on Prior Survey Articles

Ref. Focuses Major Viewpoints


DP guaranteed statistical datasets,
[21] 2008 Statistical databases
DP guaranteed learning theory
[22] 2009 Theoretical frontier DP synthetic datasets and coresets, interdisciplinary areas
DP extensions, interactive and non-interactive analysis,
[18] 2013 Health data release
practical limitations and opportunities
[24] 2017 Privacy attacks Reconstruction attacks, tracing attacks
Anonymized and learning-based DP publishing,
Data release
[122] 2017 Laplace mechanism-based DP data analysis
& data mining
& private learning structure
[16] 2018 LDP tutorial LDP deployments, fundamentals, statistical tasks
[37] 2019 Decision trees DP decision tree algorithms, realistic implementations
[46] 2019 Cyber-physical systems Smart grid, healthcare systems, transportation systems, IoT
Studies inner analysis and external comparison,
[71] 2019 Histogram publication
utility-enhancing approaches
[19] 2020 Definition modifications 7 categories of DP notion variants and extensions
Insights and applications of 8 DP-GAN methods,
[28] 2020 GANs
privacy and utility measurements
Privacy attacks, privacy models,
[49] 2021 Social networks
DP & LDP in social network analysis
Cryptographic primitives for LDP deployments,
[101] 2021 DP-Cryptography
DP alternatives for cryptographic implementations
Unstructured data Adaptive DP definitions, privacy mechanisms,
This survey
content utility-privacy analysis

(4) The report [24] discussed how DP releases coarse-grained aggregate statistics while
provably thwarting reconstruction attacks and tracing attacks.
(5) Zhu et al. [122] reviewed DP theories and implementations in data publishing and mining.
(6) Cormode et al. [16] reviewed local DP technologies deployed in Google, Apple, and
Microsoft and identified several research issues in progress and in planning at that time.
(7) Fletcher and Islam [37] focused on a particular data mining algorithm, decision trees, and
analyzed the performance of existing works in privacy-utility tradeoff.
(8) The work [46] presented DP techniques for cyber-physical systems, including smart grid,
healthcare systems, transportation systems, and industrial IoT.
(9) Nelson and Reuben [71] studied DP-protecting histograms and synthetic data and analyzed
building blocks in related mechanisms.
(10) In a recent view, Desfontaines and Pejó [19] provided an overview of DP variants and
extensions, based on how the basic DP definition is modified. They also presented possible
composition results of these exiting definitions and underlying relationships among them.
(11) Fan [28] surveyed DP-protecting GAN approaches in several recent excellent studies. She an-
alyzed insights and application fields in them, besides privacy and utility evaluation metrics.
(12) Jiang et al. [49] recently surveyed DP in social network analysis based on graphs.
(13) The latest work [101] discussed cryptographic primitive-assisted LDP deployment including
secure computation and anonymous communication in amplifying privacy levels and re-
ducing utility loss, and LDP alternatives for the practical and lightweight implementations
of cryptographic primitives simultaneously.

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:4 Y. Zhao and J. Chen

While those surveys focus on DP theories and applications in structured data scenarios, this work
investigates adaptive DP definitions, mechanisms, and privacy-utility tradeoffs for unstructured
data content, including image, audio, video, and text. For simplicity, this work usually uses
unstructured data for unstructured data content afterward, excluding tags, timestamps (or time
durations), and places together with content. Section 2 presents related DP models including
basic DP, local DP, and Metric DP. Section 3 summarizes vectorization methods, privacy models,
and mechanisms for image, audio, video, and text, together with some discussions about their
challenges. Section 4 analyzes DP performances in privacy and utility tradeoff from several per-
spectives including utility loss sources, utility metrics, and privacy against AI attacks. Section 5
identifies several open issues and research directions for DP and its potential in handling privacy
issues in broader data scenarios.

2 PRELIMINARIES
2.1 Differential Privacy Models
DP [23] has been a de facto privacy model motivated by the indistinguishability idea in cryptog-
raphy. It is originally formalized to provide a statistical assurance that released aggregate data
cannot reveal whether individuals are in or out of the dataset. In other words, for a released statis-
tical result, the probabilities of it being from two potential datasets are close [23]. The two potential
datasets are defined as neighbouring databases as follows.
Definition 2.1 (Neighboring Databases). Two unstructured data [U S]i and [U S]j are neighboring
if the aggregated real-valued data representations of them xi and xj have the same number of
elements and they differ in at most m elements. Here the m difference of xi and xj is defined over
the Hamming distance. In most cases, m is set as 1.
Definition 2.2 (Differential Privacy (i.e., ϵ-indistinguishability)). A randomized mechanism M sat-
isfies ϵ-DP iff for any neighbouring databases xi and xj (x ∈ D d and D ∈ {0, 1}n (or ∈ Rn )) and for
any released statistical result y:
 
Pr M (xi ) = y ≤ e ϵ · Pr[M (xj ) = y]. (1)
Recently, the original DP model was developed into the local DP model (i.e., Definition 2.3) [20,
53], aiming at applying the formalized privacy model to individual data in the user’s setting rather
than the aggregated setting in the curator’s setting. It is noteworthy that statistical utility is de-
termined on users’ population when local DP applied data are aggregated. More precisely, larger
populations generally cause higher utility.
Definition 2.3 (Local Differential Privacy (i.e., LDP)). Taking two inputs x i and x j , an LDP mech-
anism M outputs a probabilistic y following:
 
Pr M (x i ) = y ≤ e ϵ · Pr[M (x j ) = y], (2)
where x and y are over the field of d-dimensional real numbers or bit values. x i and x j are not
related at all.
Another DP variant known as d X -metric privacy [11] is put forward and originally tailored as
geo-indistinguishability (i.e., geo-ind) [3] to protect location privacy. Geo-ind argues that LDP is
impractically protecting geo-location by randomizing a location x i to any other unrelated locations
x j . For example, given Alice is learning in the library, which is mathematically symboled as 1 for
simplification, after LDP obfuscation, the event can be transformed to another one of “Alice is not
in the library” (denoted as 0) with a non-negligible probability. Alice would search for a cafeteria
at lunch time. Unfortunately, it is not feasible for location-based applications such as Yelp to find

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:5

out an appropriate cafeteria for Alice based on her current obfuscated location. In this case, it is
more acceptable to obfuscate Alice’s exact position to its nearby ones with higher probabilities
and further ones with lower probabilities. Geo-ind achieves this requirement in the geo-location
privacy domain and introduces the geographic distance (i.e., Euclidean distance) besides privacy
budget parameter ϵ to constrain the randomization process.
The notion of d X -metric privacy later generalizes geo-ind and broadens Euclidean distance
(L 2 norm) to Manhattan distance (L 1 norm), Minkowski distance (Lp norm), and Chebyshev dis-
tance (L ∞ norm) [15, 102] in a variety of DP privacy-protecting scenarios. Take several scenarios
as examples. Chebyshev distance is taken in obfuscating the reading data of smart meters for the
privacy of TV channels being watched [3], while Manhattan distance is adopted in randomizing in-
dividuals’ registration date on Facebook against adversaries’ identification of individuals [3]. These
distance functions d X can be formally defined as X ×X → R+ and they form the key component
of the d X Metric Privacy definition together with the privacy budget parameter ϵ.
Definition 2.4 (Metric Differential Privacy (i.e., Metric DP)). A probabilistic mechanism M satisfies
Metric Differential Privacy iff any two real-valued inputs x i and x j produce an identical output y
with probabilities differing at most e ϵdX :
 
Pr M (x i ) = y
≤ e ϵdX (xi ,x j ) , (3)
Pr[M (x j ) = y]
where the d X metric satisfies the axioms of a metric as follows: d X (x i , x j ) ≥ 0, d X (x i , x j ) =
d X (x j , x i ), d X (x i , x i ) = 0, d X (x i , x j ) ≤ d X (x i , x k ) + d X (x k , x j ), with x i , x j , x k ∈ X. We note that
when d X (x i , x j ) = 1, local DP is recovered. It is also implied that Metric DP is a relaxed variant of
standard DP in [52]. Overall, Metric DP extends the basic DP definitions and has been a practical
notion sufficient in defining some events being closer to the original events than others in various
distance metrics requiring higher-occurring probabilities. It therefore addresses some limitations
of basic DPs defining events’ difference over Hamming distance. To make it more clear about dif-
ferent distance metrics, we provide some fundamental distance metrics [15] in detail as follows.
Their geometric significance is also shown in Figure 2.
Definition 2.5 (Hamming Distance). Hamming distance defines H  = 1 if H is true while H  =
0 otherwise:

n
x − yH := x i  yi .
i=1

Definition 2.6 (Manhattan Distance). Manhattan distance is equivalent to Hamming distance


when x, y ∈ {0, 1}n :

n
x − yL1 := x i − yi  .
i=1

Definition 2.7 (Euclidean Distance). Euclidean distance implies Manhattan distance when n = 1:

n
x − yL2 := (x i − yi ) 2 .
i=1

2.2 Differential Privacy Mechanisms


Differential privacy mechanisms are probabilistic mapping functions that implement DP defini-
tions. These methods are applied on the sensitive data(set) x to obtain a substitute of it. The
replacement approach is based on the elaborately designed probabilistic mapping functions, to

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:6 Y. Zhao and J. Chen

Fig. 2. A geometry description of different distance metrics, where red lines with the ✕ label indicate dis-
tances between x and y. Manhattan and Euclidean metrics are linear distances between two points, while
Hamming distance measures the number of bit positions with different bits between two same-length data
strings.

guarantee the substitute of x reveals the coarse-grained information about x to some extent. We
then present several commonly used methods that achieve DP, local DP, and Metric DP.
Definition 2.8 (L 1 Sensitivity). The L 1 sensitivity of a query function q : D d → R is the maximum
L 1 norm of q(xi ) and q(xj ), formally defined as
Δq = S (q) := max xi ,xj q(xi ) − q(xj )L1 .
Laplace mechanisms [23] obfuscate data by adding appropriate noise sampled from Laplace
distribution to it. The noise scale is determined by pre-set privacy parameters ϵ and the sensitivity
Δ of query functions. Smaller ϵ indicates larger noise scale, and then stronger privacy-persevering
results.
Definition 2.9 (Laplace Mechanism). For a query function q, Laplace mechanisms function as
follows:
M Lap (x) := q(x) + Lap(Δ/ϵ ). (4)
The Exponential mechanism [64] is another popular mechanism within DP. A building block is
a score function defined as f : D d × Y → R in it. Exponential mechanisms usually randomize the
“favorable” categorical attribute x into potential attribute candidates y1 , y2 , . . . , yn ∈ Y with their
corresponding probabilities proportional to the sensitivity Δf of the score function by taking the
exponential distribution. Formally, it is
 
exp 2Δϵ f f (x, y)
Pr[M Exp (x ) = y] :=   . (5)
n ϵ
i=1 exp 2Δf f (x, y i )
To achieve local DP, the classic Randomized Response [62, 108] method has been developed to
obfuscate bit vectors in the users’ local setting. Google’s RAPPOR [25] is initially widely deployed
in reality, such as our Chrome Web browser, in order to protect our privacy preferences from being
revealed to Google Inc. and any other third parties.
Definition 2.10 (RAPPOR-based Randomized Response). For a bit vector x, RAPPOR randomly
flips each bit x of it following the rule of

⎪ 1 − 12 p, i f x = 1,
M RAPPOR (x ) := Pr(y = 1) := ⎨
⎪ 1 p, (6)
⎩2 if x = 0
2
p= .
1 + exp ϵ2

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:7

As for Metric DP, a primary distance metric is Euclidean distance. It is an intuitive idea
to extend basic univariate Laplace distributions to bivariate Laplace distributions to sample
noises for achieving 2-dimensional Euclidean-distance-based Metric DP obfuscation (a.k.a. geo-
indistinguishability [3]). The Planar Laplace mechanism [3] employs this idea and adopts
2-dimensional Laplace distribution in the Cartesian coordinate system to generate noises. Specif-
ically, the Planar Laplace mechanism defines to produce 2-dimensional Laplace noise centered at
the real 2-dimensional data x with the probabilistic function
ϵ2
M Planar (x)(y) := exp −ϵd L2 (x, y) , (7)

which can be transformed to the probability function of Equation (8) centered at x in the Cartesian
coordinates system:
ϵ 2r
M Planar (r , θ ) = exp(−ϵr ). (8)

It is a practical method to generate a radius r following the probability function M Planar (r ) =
ϵ 2r exp(−ϵr ) and then sample a random angle θ uniformly from (0, 2π ] to obtain a 2-dimensional
noise vector ẋ : r cos θ, r sin θ and then a randomized output y = x + ẋ.
In the generalized n-dimensional Euclidean space, the radius value r (i.e., the magnitude of
vector) can be sampled from the Gamma distribution defined in Equation (9), while the angle of
vector Θ can be produced from a uniform distribution (Equation (10)) in the unit hypersphere in
Rn−1 [27, 31].
Definition 2.11 (Gamma Distribution). Gamma distribution takes in the shape n and the scale ϵ
and outputs r with the probability function:
r n−1e −ϵr
Gammaϵ,n (r ) := , (9)
ϵ −n (n
− 1)!
where it is feasible to obtain r by using numerical root finding methods including the Newton-
Raphson and secant method in implementation.
Definition 2.12 (Uniform Distribution). Uniform distribution samples every element in
{θ 1 , · · · , θ n−1 } of Θ following the rule of
Γ( n )
U (Θ) := 2n (Θ ∈ H n ), (10)
nπ 2
∞ n
where U (Θ) is over the surface of the unit hypersphere, Γ( n2 ) := 0 x 2 −1e −x dx is the Gamma
function, and H n := {Θ ∈ Rn−1 &Θ = 1}. When the length (n − 1) of the vector Θ is large, it is
alternative to sample Θ from multivariant normal distribution [34]:
1  1 
T −1
N (Θ) := n 1
exp − Θ Σ Θ , (11)
(2π ) 2 |Σ| 2 2
where Σ denotes the n × n variance-covariance matrix of the variable Θ.
Now it is sufficient to generate a n-dimensional noise vector ẋ:
ẋ[1] = r cos θ 1 (12)
ẋ[2] = r sin θ 1 cos θ 2 (13)
··· (14)
ẋ[n] = r sin θ 1 sin θ 2 · · · sin θ n−2 sin θ n−1 , (15)
where its further explanation is detailed in [31].

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:8 Y. Zhao and J. Chen

Take a further step: for other cases of arbitrary distance metrics, the Exponential mechanism
for numerical data is deduced to generalize Metric DP Mechanisms [12] (different from the Expo-
nential mechanism for categorical data in DP-protecting database query). Exponential mechanism
MdX (·)(·) returns obfuscated real-valued vector y with its probability inversely proportional to its
(arbitrary) distances from the real vector x following

MdX (x)(y) := c x exp (−ϵd X (x, y))


−1

c x =  exp (−ϵd X (x, y )) , (16)
y

where c x is a normalization parameter, and smaller distance indicates higher probability.

3 DIFFERENTIAL PRIVACY METHODS FOR UNSTRUCTURED DATA CONTENT


We consider the problem that a data owner would like to share unstructured data mainly including
images, audios, videos, and texts with untrusted third parties in social media, smart devices, and
surveillance devices. Concerning the issue that these unstructured data may contain privacy-
sensitive information, such as faces, identities, license plates, voiceprints, authorships, and various
other personally identifiable information (PII), the data owner must obfuscate them and
publish their coarse-grained versions, in order to protect these PII from unintentional release
and illegal use. A rich set of obfuscation methods have been proposed to protect unstructured
data [38, 74, 86, 96]. Unfortunately, the majority of them suffer from a variety of deficiencies, such
as heavy computational overhead, inner attack, and nonprovable privacy. The differential privacy
model has emerged to avoid these deficiencies. We summarize the adaptive DP deployment in the
content of unstructured data in the data owner’s local setting, as shown in Table 2. In general, un-
structured data is transformed to vector representations through methods including pixelization,
SVD, bit vectorization, and word embedding, in which human perception of unstructured data
content should be retained as much as possible. After this precomputation process, vectors are
obfuscated in appropriate privacy models, followed by the obfuscated vectors’ projection back to
the unstructured data domain. The general obfuscation process is illustrated in Figure 1.

3.1 DP Methods for Image Content


Image data, especially faces and texts, can reveal personal sensitive information. When shared with
untrusted third parties, its privacy issues must be taken into account. It is a solution to sanitize
private regions before sharing images. Pixelization, blacking, and blurring are the most popular
technologies in obfuscating regions of interest (ROIs) in images. For instance, the box blurring
algorithm was introduced in the privacy-protecting Google street view to protect human faces
and vehicle license plates in [38]. However, several recent studies pointed out that these tradi-
tional methods are not sufficient to guarantee image privacy due to the fact that machine learning
models can be adapted to re-identify faces and texts [47]. To further enhance image privacy, some
more sophisticated models have been developed. In [94], the GAN-based method was designed to
inpaint head regions while protecting their naturalness. A recent work [107] examined adversarial
examples to protect compressed images. Unfortunately, this series of methods have not yet pro-
vided convincingly formal privacy. Differential privacy has emerged as a state-of-the-art technol-
ogy to address this problem. We therefore focus on summarizing its adaptive adoptions in defining
image data privacy. Since we target applying DP directly to image content on users’ sides, we will
not include the work [93, 110, 111], in which differential privacy is guaranteed in the Machine
Learning training process.

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:9

Table 2. Summary of DP Methods for Unstructured Data Content


Ref. Private Data Vectorization Privacy Model Privacy Mechanism Challenges
Pixel DP
[26] Human Face Pixelization Laplace Low utility
for Aggregated Pixels
Adaptive privacy budgets
[59] Human face GAN Latent Coding Latent Vector DP Laplace
w.r.t. components’ prominences
More-utility-retained
SVD
[27] Euclidean Privacy human visual perception models
Human Face Latent Code Multivariant Laplace
[13] for Individual Images Distinct privacy requirements
Representation
of different face sections
Voice Indistinguishability Pre-computation time contradicts
[44] Voiceprint x-vector Exponential Mechanism
for Aggregated Voiceprints DP effectiveness
Persons Video DP Pixel Sampling, Laplace
[104] Pixelization Limited query types
in Videos for Aggregated Visual Elements Pixel Interpolation
Object-level indistinguishability
Persons Object Indistinguishability RAPPOR-based
[103] Bit Vector extended to
& Their Trajectories for Individual Objects Randomized Response
video-level indistinguishability
Hamming Privacy
[109] Sensitive Text GloVe Exponential Mechanism High precomputation overhead
for Aggregated Words
Earth Mover’s Privacy Semantically meaningless
[31] Sensitive Text word2vec Multivariate Laplace
for Individual Bag-of-Words in the high privacy regime
Multivariate Normal Semantically meaningless
[34] Euclidean Privacy Gamma Distribution in the high privacy regime
Sensitive Text GloVe, FastText
[114] for Individual Words Truncated Gumbel Mechanism Privacy level quantization
Poisson Sampling w.r.t neighborhood density
Pretrained Word Euclidean Privacy and LDP Exponential Mechanism
[120] Sensitive Text Definition of sensitivity levels
Embedding, BERT for Sensitive Tokens Mangat’s Randomized Response
Hyperbolic Hyperbolic Privacy
[35] Sensitive Text Hyperbolic Distribution Homonymic words
Embedding for Individual Words
Mahalanobis Privacy Multivariate Normal
[115] Sensitive Text FastText, GloVe N/A
for Individual Words Gamma Distribution
Laplacian Variants
[116] Sensitive Text Glove, FastText Euclidean Privacy N/A
Vickrey Mechanism
Any Distance Gaussian Random Matrix
[36] Sensitive Text Glove, FastText Improved random projections
Metric-based Privacy Gamma Distribution

3.1.1 Pixel DP (Based on DP). The work [26] initially adapts the differential privacy model
to the image domain, in which the notion of adjacent databases is developed to that of multi-
pixel neighboring images. Further, the definition can be explained in that the presence or absence
of several pixels in the image has little impact on private output. In this method, the image is
pixelized as grid cells before being applied with Laplace perturbation. The noise scale is controlled
by sensitivity (i.e., Definition 3.1) calculated from the grid cell size and the number of differing
pixels, as well as the privacy budget. Taking the perturbed result, adversaries cannot re-identify
the original pixelization result effectively and cannot further launch attacks to recover the original
image with it. However, the experimental results indicate that image shape is all lost for naked
eyes in this privacy model even in the low-privacy regime, so this pioneering attempt provides an
excessively strong privacy guarantee.
Definition 3.1 (Sensitivity of Image Data).

Δ := maxIi , I j xi (Ii ), xj (I j )  = 2 ,


255p
(17)
b
where Ii and I j are neighboring images, while x(·) are their vector representations. The difference
between two images is at most the multiplication of 255 per pixel and p pixels, and images are
divided into b 2 cells.
3.1.2 Latent Vector DP (Based on DP). To preserve the perceptual quality of noisy images, Li
and Clifton [59] proposed to provide DP guarantees for image latent vector representation. In
it, privacy budgets are allocated to various elements in the latent space in accordance with their
weights, while sensitivity is determined with the inspiration of observed sensitivity from [14]
that clips sensitivity into the maximum observed bounds. To this end, proper calibrated Laplace

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:10 Y. Zhao and J. Chen

noise is derived and added to images in the semantic space before GAN is used to synthesize
realistic-looking faces. Future work might improve utility with the privacy budget allocation in
a non-uniform manner; i.e., privacy budgets can be adapted and varied as per the prominence of
protected components. For example, age and gender features are given smaller ones than emotional
states.
3.1.3 Euclidean Privacy (Based on Metric DP). A follow-up paper [27] precomputes ROIs, such
as faces in image data, with the SVD transformation to extract the key values in image repre-
sentation. It proposes to broaden the 2-dimensional Euclidean-distance-based metric privacy (i.e.,
geo-indistinguishability) in the geo-space to the k-dimensional privacy model. With it, similar in-
puts of singular values (quantified with Definition 3.2) cannot be probabilistically distinguished by
observing the output. The multi-variant Laplace mechanism is employed to achieve this privacy
model in design and used to produce the noise vector in implementation. Obfuscated key values
are therefore obtained with the noise vector, followed by being fed to reconstruct the obfuscated
image thanks to the invertibility property of SVD. It is also similar to do the obfuscation process
for other invertible transformations, such as Discrete Cosine Transform. It is a challenge to take a
further step to preprocess original images in a more advanced human visual perception model in
order to boost image data utility.
Definition 3.2 (Euclidean Distance for Image Data).
d (Ii , I j ) := d X (xi , xj ) = xi − xj , (18)
where x are vectors made up of SVD singular values for image matrix (i.e., multi-dimensional
vectors over the field of real numbers).
In a recent study [13], the authors targeted generating privatized realistic-looking faces. They
use GANs to learn image semantics such as smiling and young attributes to obtain latent codes for
facial images. The perception indistinguishability notion is then defined for latent codes’ represen-
tation of images, while perception distance between latent codes is defined as semantic dissimilar-
ity between images. One potential improvement of privacy-utility reconciliation is that different
parts of human faces are in demand of distinct privacy levels; e.g., the eyes section, especially the
iris and retina, need higher privacy protection than the shape of faces since the iris biometric plays
a more significant role in authentication applications.

3.2 DP Methods for Audio Content


Concerns about speech data security and privacy arise with the development of smart devices,
such as Apple’s HomePod and Amazon Echo. On one hand, personal voiceprints, as a kind of
unique biometric identifier contained in speech data, similar to fingerprints, are emphasized as
strong privacy indicators in the act GDPR [70]. On the other hand, voiceprints play key roles
in voice authentication systems, and once released or stolen, many serious privacy risks will be
caused. To tackle these privacy concerns, some solutions have been proposed. Anonymization
is the most common method. In [51], voiceprints are removed by substituting speech data with
texts, after which texts are synthesized with a specific voiceprint to generate de-indentified voices.
Similarly, another work [30] investigates replacing speaker identities with anonymous pseudo-
identities by extending k-anonymity technology and then feeding them into neural waveform
models to synthesize anonymized speech. It is another good solution to obfuscate speech data
on vocal tract length normalization (VTLN) space [76], with which key parameters in voice
conversations can be hidden. While the voicemask intermediary is introduced to modify speakers’
voiceprints between speaker and speech data collector in [77], the selection of key parameters is
investigated in [90]. These methods indeed can provide practical privacy protection for voiceprint

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:11

data but cannot clarify how much privacy is protected against various attacks. Differential privacy,
which we are more concerned with afterward, can achieve this quantifiable privacy goal.
3.2.1 Voice Indistinguishability (Based on Metric DP). The work [44] formally defines the voice
indistinguishability model (i.e., Definition 3.3) based on Metric DP to guarantee provable privacy
protection for our voiceprints. In it, x-vector [88] is employed to represent voiceprint, and angular
distance (a specialization of cosine distance fulfilling triangle inequality), defined in Equation (20),
is adopted to measure the similarity of voiceprint pairs. Voice indistinguishability can therefore
guarantee that any adversary cannot guess whether the randomized output voiceprint is from its
real owner Alice or another person Bob. Alice’s voiceprint is randomly substituted by others’ (e.g.,
Bob’s) with corresponding probabilities inversely proportional to the angular distance between
their voiceprint vectors. The obfuscated voiceprint is then synthesized with the speech content
to obtain the obfuscated audio data. Although voice indistinguishability works well in providing
quantifiable privacy guarantees for audio data, it needs to pre-compute numerous distances for
voiceprint pairs, which would cost time, especially when the voiceprint database is overly large.
Note the fact that differential privacy works better in larger databases. The pre-computation time
and differential privacy effectiveness therefore contradict each other. It is another challenge to
extend the metric privacy concept to the speech level.
Definition 3.3 (Voice Indistinguishability). A mechanism M satisfies ϵ-voice indistinguishability
iff for any input voiceprint pairs xi and xj ∈ Rn (i, j ∈ 1, 2, . . . , #{voiceprints} , n is the length of
vectors representing voiceprints), and for a randomized vector y:
Pr [M (xi ) = y] ≤ e ϵ ·dX (xi ,xj ) · Pr[M (xj ) = y]. (19)
Definition 3.4 (Angular Distance).
arccos(cos similarityxi , xj )
d X (xi , xj ) := (20)
π
xi · xj
cos similarityxi , xj := .
xi xj 

3.3 DP Methods for Video Content


Millions of videos are collected through video surveillance devices and other video recording fa-
cilities. Privacy issues within them have received increasing attention recently, since they contain
considerable sensitive information, including human bodies, faces, and activities. Many privacy-
protecting technologies have been applied to video data. IBM pioneered the path to study privacy-
persevering video surveillance systematically and developed the PeopleVision system [85]. After-
ward, sanitization schemes, including blurring and pixelization [47], as those are applicable for
ROIs in individual images, have been proposed. With privacy requirements being more sophisti-
cated, some other sensitive information, such as activities and places in videos, was also investi-
gated with privacy methods [82]. However, the key challenge is that this series of schemes cannot
ensure certain levels of privacy against adversaries with unknown prior knowledge. We abstract
several state-of-the-art privacy protection technologies with DP guarantees.
3.3.1 Video DP (Based on DP). The videoDP model [104] defined in Definition 3.5 customizes
DP to the video privacy domain, in which video data is represented as pixels with content of
3-dimensional vectors in the RGB color model and neighboring inputs are defined as pairs of videos
dissimilar in any visual elements (e.g., humans) in full videos. Specifically, this DP adaptive notion
guarantees it makes no changes to video analytical queries whether a certain human or object
is absent or present in a video. Pixels are randomly sampled in the differentially private manner

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:12 Y. Zhao and J. Chen

before being put into distinct RGB categories. After this procedure, pixel counts in most frequently
occurring RGBs are separately obfuscated with an appropriate Laplacian noise scale. Other pixels
unselected are interpolated as the DP post-processing operation to reconstruct the visual elements.
The main challenge is to examine more statistical query types when calibrating the noise scale,
especially in this case of non-interactive DP model.
Definition 3.5 (ϵ-Video DP). A randomized mechanism satisfies ϵ-Video DP iff for all pairs of
videos V and V differing in one visual element throughout the entire frames in the video, and for
any output statistical query response y:
   
Pr M (V ) = y ≤ e ϵ · Pr M (V ) = y . (21)
3.3.2 Object Indistinguishability (Based on LDP). The notion of ϵ-object indistinguishabil-
ity [103] (i.e., Definition 3.6) is defined to protect the privacy of humans and other sensitive objects
as well as their moving trajectories from frame to frame in the video. As an instantiation of LDP
in the video scenario, the proposed definition focuses on providing indistinguishability for indi-
vidual objects in the obfuscated output video. Specifically, the event of an object being in or out
of a frame is denoted with 0 or 1 and the event of an object’s moving activity in a video (i.e.,
a sequential of frames) is therefore represented by bit streaming vectors. The definition guaran-
tees that an adversary with arbitrary prior information cannot figure out whether the released
bit vector is generated from the object Alice or Bob. Consequently, the privacy of individuals and
their trajectories are protected. As for obfuscated methods, the RAPPOR-based randomized re-
sponse is adopted as a basic primitive mechanism. The privacy budgets are allocated to the key
frames in the video rather than all the frames in order to boost the utility of the randomized video.
After the obfuscation step, the obfuscated objects and background scenes are integrated to recon-
struct the complete obfuscated video. This work first models the video events with bit vectors and
then formalizes object privacy in the video. As it is mentioned in the prior basic knowledge section,
it is a worrying problem that accuracy error is high for small user populations in vanilla local DP.
It is also a challenge to improve statistical utility in videos with sparse objects.
Definition 3.6 (ϵ-object Indistinguishability). A probabilistic mechanism M is ϵ-object indistin-
guishability iff for any two input objects xi , xj ∈ {0, 1}n (i, j ∈ 1, 2, . . . , # objects , n is the num-
ber of all the frames in an video) in the original video V , for any output object y (y ∈ ranдe (M ))
in the obfuscated video V , and for all adversaries:
  Pr[M (xi ) = y]  
ln  ≤ ϵ. (22)
 Pr[M (xj ) = y] 
However, recent research [10] showthat it is unrealistic or demanding to detect all private infor-
mation and then protect it in each frame of videos in the aforementioned approaches. To implement
private video analysis practically, the authors defined a new notion of event-duration privacy to
provide privacy for any captured rather than detected things during a pre-determined time period,
freeing from the requirements of perfect detection results.

3.4 DP Methods for Text Content


Massive amounts of textual data are frequently shared digitally among different entities currently.
In many cases, it is desirable that the authorship of them remains anonymous, since they could
sometimes disclose some sensitive details of their author. This is especially true for those involving
healthy conditions, locations, and various personal lifestyle preferences. A rich set of work has
been proposed to tackle these privacy concerns, including generalization [2], perturbation [81],
and sanitization [83]. However, they are not shown to be mathematically well-founded privacy

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:13

models and robust against adversarial attacks. It is a challenge to formally define and guarantee
a certain level of the authorship privacy of textual data while retaining their content substance
as much as possible in a rigorous mathematical manner. The definition of differential privacy and
its extension d X metric privacy can serve as primitive concepts to address this challenge. Sev-
eral works have made some contributions in this exploration by adaptively tailoring distance met-
rics. Within them, word embedding models including GloVe [73], FastText [7], word2vec [68], and
Hyperbolic embedding [39] are selected to transform textual data into high-dimensional vectors;
they are then perturbed with noise vectors sampled from probabilistic distributions. In general,
the noisy real-numbered vectors are finally projected back to their closest words in the embed-
ding model. In this step, discretization operation is adopted due to the discrete nature of textual
data, which also fulfills the quantified metric privacy goal due to the post-processing property
inherited from standard differential privacy [52]. We summarize these works based on different
distance metrics they select to stretch privacy level ϵ, which is further developed into respec-
tive indistinguishability metrics. For the sake of simplicity, we specify these works as Hamm-
ming Privacy, Earth Mover’s Privacy, Euclidean Privacy, Hyperbolic Privacy, and Mahalanobis
Privacy.
3.4.1 Hamming Privacy (Based on DP). The work [109] takes the pioneering step to apply the
Hamming-distance-based privacy (i.e., traditional differential privacy) method in the text privacy
domain. It mainly investigates how to hide the term-frequency-related distributions in the bag-
of-words embeddings. For this purpose, the exponential mechanism is employed to randomize
discrete terms to sets of candidate terms with associated probabilities, in which the probabili-
ties are decided by the similarity metric between input and potential output terms. By trading
off the utility-privacy requirements, the similarity function (i.e., Definition 3.7) is created as the
composition of cosine similarity and bigram overlap variables. The advantage in this case is that
the general discretization step doesn’t need to be carried out due to the categorical nature of the
exponential mechanism output. However, it can be observed that there exist certain disadvan-
tages in this initial try of differential privacy implementation in the textual data. For example, it
is a bit time-consuming to precompute the occurring likelihood of vocabulary candidates term by
term.
Definition 3.7 (Score Function).
d X (xi , xj ) := cos(xi , xj ) − sB(xi , xj ), (23)
where cos(xi , xj ) represents the cosine similarity for GloVe [73] word vectors, and B(xi , xj ) rep-
resents the bigram overlap, which measures word correlations in contexts. s is the scaling factor
adapting to what extent the bigram overlap affects the score function.
3.4.2 Earth Mover’s Privacy (Based on Metric DP). The paper [31] targets exploiting the idea
of metric privacy to protect the author identity of the documents that are similar in topic. Specif-
ically, text documents are represented as bags-of-words, equivalently vectors made of discrete
words and their corresponding frequencies. Word embeddings are further adopted to represent
words inside. For quantified privacy guarantee, metric privacy (i.e., ϵd X -indistinguishability) is
employed based on the Euclidean distance of word embeddings. The smaller distance indicates the
higher indistinguishability of similar words. In practical implementations, words can be mapped
to more similar words with higher probability by using multi-dimensional mechanisms. With the
Euclidean-distance-based metric privacy for single words extended to the Earth mover’s distance
for bags-of-words, Earth mover’s distance-based metric privacy, therefore, is developed as an ap-
propriate privacy model for documents. Its privacy definition is formalized as follows:

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:14 Y. Zhao and J. Chen

Definition 3.8 (Earth Mover’s Privacy). A mechanism satisfies ϵEdX (d X is the Manhattan metric
on word pairs, EdX is the Earth Mover’s metric on bag-of-words vector pairs) iff for any inputs
xi , xj ∈ Rn (i, j ∈ 1, 2, . . . , #{bag-of-words} , n is the length of bag-of-words vectors representing
documents) and output y:
M (xi )(y) ≤ exp(ϵEdX (xi , xj )) · M (xj )(y). (24)
Definition 3.9 (Earth Mover’s Metric).

EdX (x, y) := min d X (x i , yi )Fi j (25)
x i ∈x yi ∈y

s.t .

n
fj n
дi
Fi j = and Fi j = ,
i=1
|y| j=1
|x|

where дi is the x i occurring frequency, while f j is the yi occurring frequency, and дi = |x|,

f j = |y|, Fi j = 1/n.
3.4.3 Euclidean Privacy (Based on Metric DP). Another close line of work [34] focuses on ex-
ploring the privacy of a word string, in which the d X privacy model is achieved by multivariate
normal distribution and gamma distribution. Different from the Earth mover’s distance in [31],
this work defines the similarity of word string pairs as the accumulated sum of individual word
pairs’ Euclidean distance (i.e., Definition 3.10) by regarding each word individually. For the further
detailed implementations, the magnitude of the noise vector is sampled from gamma distribution,
while the weight of each element in the vector is sampled with multivariate normal distribution.
While the standard Euclidean distance can indicate semantic similarity between word strings, it
has the problem that randomized word strings may deviate too far away from the original ones in
the high-privacy regime, thus leaving little utility to the analysis tasks. A follow-up work [120] di-
vides tokens including characters and wordpieces into sensitive and non-sensitive types and then
only sanitizes sensitive ones to improve overall utility. It is rather remarkable that the definition of
sensitive tokens varies across individuals. A personalized mechanism can be introduced to tag sen-
sitive tokens in users’ sides. The definition of Utility Optimized Metric LDP (i.e., Definition 3.11)
is further proposed for tokens’ privacy by considering both the sensitivity and similarity levels
between input pairs. Mangat’s randomized response [62] and exponential mechanism are used to
carry out token sanitization. A concurrent work [114] considers that densely located words have
high probability to be substituted with close irrelevant words in Euclidean space. A truncated Gum-
bel mechanism is further proposed to derive only relevant nearby replacements. Specifically, the
Poisson sampling modification is adopted to obtain top-k closest candidates set around current
words before truncated Gumbel noise is injected to inter-word distances. After that the nearest
substitution is selected. An intriguing point here is to calibrate noise scale (i.e., privacy level pa-
rameter ϵ value) as a function of neighborhood density at original words.
Definition 3.10 (Euclidean Distance).

n 
n
d X (x, y) := d (x i , yi ) = x i − yi , (26)
i=1 i=1

where x i denotes the ith word vector of a word string x, and d (x i , yi ) denotes the Euclidean distance
between the real word and its obfuscated version on R.

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:15

Definition 3.11 (Utility-optimized Metric LDP). Given x[S] ⊆ x and x[U ] ⊆ x, where x[S] and
x[U ] denote sensitive and non-sensitive tokens, y[P] ⊆ y denotes protected tokens. Let ϵu and ϵm
be privacy parameters for utility-optimized LDP and metric DP, respectively. Given a proximity
measurement d X : d X × d X → R+ , M satisfies (x[S], y[P], ϵu , ϵm )-Utility-Optimized Metric LDP, if
(1) for any yi ∈ y[U ], where y[P] ∩ y[U ] = ∅, there exists an x i ∈ x[S] such that
Pr[M (x i ) = yi ] > 0, Pr[M (x j ) = yi ] = 0 ∀x j ∈ x \ x i ; (27)
(2) for any x i , x j ∈ x and any y j ∈ y[P],
   
Pr[M (x i ) = y j ] ≤ exp ϵm d X x i , x j + ϵu Pr[M (x j ) = y j ]. (28)

3.4.4 Hyperbolic Privacy (Based on Metric DP). The paper [35] proposes to represent slot values
with a more advanced word embedding model, specified as Hyperbolic embedding, since Hyper-
bolic distance (an instantiation is Definition 3.12) within this embedding model is demonstrated
to be better in capturing hierarchical relationships and semantic similarity between word pairs,
in comparison with Euclidean distance. Conforming to the design criteria in the traditional d X
metric privacy, corresponding probabilistic distribution is constructed. In it noise vectors are sam-
pled with their likelihoods inversely proportional to their distance with the actual vectors in the
Hyperbolic space. By taking this well-attuned obfuscation method, the slot values are therefore
generalized to broader words in semantic meaning, instead of exact ones. The exact owner of the
textual data can be protected from inference attacks, while his or her initial intent in the data can
be obtained. The reason this privacy model outperforms Euclidean models above is that a specific
word is substituted by its associated general term, instead of an arbitrary one in Euclidean space.
However, this privacy model does not take homonyms into account. That is, same-spelled words
may indicate different meanings; it may generalize “Dove” to personal care products or leisure
food.
Definition 3.12 (Poincaré Distance).
d X (xi , xj ) := arcosh(1 + δ (xi , xj )) (29)
xi − xj  2
δ (xi , xj ) = 2 ,
(1 − xi ) 2 (1 − xj ) 2
where arcosh is the inverse hyperbolic cosine function, and xi and xj are word vectors.
3.4.5 Mahalanobis Privacy (Based on Metric DP). A recent paper [115] points out that texts
in sparse regions would probably remain unchanged after perturbation with spherical-like noise
in [31, 34]. Authors propose perturbation methods based on elliptical noise to address this problem.
The Mahalanobis-distance-based privacy definition is developed to provide sufficient privacy guar-
antees to words in the sparse regions. It guarantees that any two words in the word embedding
space cannot be distinguished and the indistinguishability level is proportional to the Mahalanobis
distance between them. The Mahalanobis mechanism is also designed to achieve the definition.
Within it, multivariate normal distribution is used to generate a randomized unit vector, while
gamma distribution is used to sample a randomized magnitude for the unit vector. Particularly
crucial is the composition of the noise. In it, the positive definite matrix Σ and the tuning parame-
ter α in Mahalanobis distance, as shown in Definition 3.13, stretch common spherical-like noise to
elliptical noise; more precisely, Σ controls the direction of elliptical contour, while α controls the
scale of elliptical contour. By transforming spherical noise in [31, 34] to elliptical noise in this work,
the Mahalanobis privacy model improves the probability of substituting a word in sparse regions
with its nearby one, and therefore boosts the privacy guarantees for words in sparse regions.

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:16 Y. Zhao and J. Chen

Definition 3.13 (Mahalanobis Distance and Its Regularized Version).



 T  
d X (xi , xj ) := xi − xj Σ−1 xi − xj (30)


 T  
d Xr (xi , xj ) := xi − xj {α Σ + (1 − α )In }−1 xi − xj , (31)
where α ∈ [0, 1] and Σ is a positive definite matrix. When α = 1, the regularized Mahalanobis
distance narrows to Mahalanobis distance; when α = 0, the regularized Mahalanobis distance
narrows to Euclidean distance.
A recent work [116] also focuses on low privacy guarantees for rare words such as street names
in sparsely populated regions of embedding space. To avoid the event that perturbed vectors are
discretized to original words instead of distinct nearby ones, the authors introduce Vickerey mech-
anisms originating from Vickrey auction [99] in the year 1961. With these mechanisms, noisy vec-
tors have high probability to be remapped back to their second nearest words, and therefore sig-
nificantly reduced probability to find their first nearest neighbors as original words. In particular,
Vickerey mechanisms are effective privacy boosters after any distance-metric-based mechanisms
are applied to word points.
Additionally, another study [36] proposes a general method to improve utility in differentially
private vector embedding. In it, high-dimensional vectors are projected to low-dimensional space
with a standard Gaussian random matrix approach before being perturbed with appropriately cali-
brated noise. An interesting exploration is to adopt state-of-the-art random projections such as [91]
to decrease utility error in dimensionality reduction.

4 UTILITY AND PRIVACY ANALYSIS IN DP-PROTECTED UNSTRUCTURED


DATA CONTENT
DP and its variants have provided rigorous privacy guarantees for unstructured data against ad-
versaries with arbitrary background information. While privacy loss has been quantified with the
ϵ value, the utility loss is measured between the real data and its obfuscated version in the experi-
mental evaluations. Different utility loss metrics are adopted to quantify utility losses in different
data types. Besides, it has been shown that other privacy-protecting methods for unstructured
data are commonly vulnerable to machine-learning-model-based attacks. It is crucial to identify
whether DP methods can do away with these AI attacks. This section mainly summarizes utility
and privacy analysis in DP-protected unstructured data content. The key focuses are shown in
Table 3. Also, the pros and cons of these privacy methods are abstracted in Table 4.

4.1 Utility and Privacy Analysis in DP-protected Image


Compared with original image content, the obfuscated version mainly loses utility from the vector-
ization and DP noise injections. The utility loss caused by vectorization is mostly discussed here.
It is known that an image is conventionally represented as an n × m matrix with elements of pixel
values. For the Pix DP method [26], the pixelization presents images with average pixel values in
larger blocks, equivalently matrices with fewer elements. The larger the blocks are, the weaker the
visual quality humans can perceive. Regarding the Euclidean Privacy method [27], the SVD tech-
nique extracts the key singular values from the matrix to exhibit image geometrical structures and
characteristics. It loses some utility to truncate entire vector values and keep some key singular
values. More singular values can be used to reconstruct images in closer visual perception. In [59]
and [13], GANs are used to learn image latent representations in feature space. The accuracy of

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:17

Table 3. Utility Loss and Re-identification Attacks in DP-protected Unstructured Data Content
Utility Loss Source AI Attack
Ref. Utility Metric
(besides DP Noise) in Privacy Evaluation
Pix DP [26] Pixelization Label Classification
SSIM, MSE
SVD-Privacy [27] SVD against CNN Attacks
Latent Vector DP [59] Latent Representations PSNR, SSIM, FID N/A
Perception Indistinguishability [13] Latent Representations SSIM, FID N/A
MSE, PLDA (ACC) Metric,
Label Classification
Voice Indistinguishability [44] x-vector Dissimilarity and Naturalness
against RNN Attacks
in Humans’ Perception
KL divergence, MSE, Object Identification
Video DP [104] N/A
Detection Precision and Recall against CNN Attacks
Object Indistinguishability [103] N/A Trajectory Deviation, Object Counts N/A
JStylo Authorship
Hamming Privacy [109] Word Embedding F1 Score
Attribution Attack
Word Embedding 1-NN,
Earth Mover’s Privacy [31] Counts of Correct Predication
& Discretization Koppel’s Attribution Attack
Classification Accuracy,
Word Embedding
Euclidean Privacy [34] Mean Average Precision, Query Scrambling Attack
& Discretization
Mean Reciprocal Rank
Word Embedding Correct Predictions Percentile Population-based
Euclidean Privacy [114]
& Discretization in Classification Task Genetic Attack
Word Embedding Classification Accuracy,
Euclidean Privacy [120] Mask Token Inference Attack
& Discretization PCCs of Semantic Similarity
Word Embedding Accuracy Score in Classification Task,
Hyperbolic Privacy [35] Koppel’s Attribution Attack
& Discretization Natural Language Task
Word Embedding Accuracy,
Mahalanobis Privacy [115] N/A
& Discretization Precision and Recall in Text Classification
Word Embedding
Euclidean Privacy [116] Classification Error Inference Error
& Discretization
Word Embedding Vector Differences
& Discretization between Embedding Vectors,
Any Distance-Metric-based Privacy [36] N/A
& Dimensionality Classification Accuracy
Reduction on NLP Database

latent vectors is determined by the closeness of the synthesized images and the real images. Over-
all, image quality is captured better in high-level feature space in [13, 59] than that in mid-level
geometry space in [27] and in low-level pixel space [26].
Generally, two utility metrics are adopted to quantify the difference between the real image
and its obfuscation, including mean squared error (MSE) and structural similarity index
(SSIM) [106]. While MSE is most commonly used in comparing data difference, SSIM, composed of
luminance, contrast, and structure comparisons, evaluates image dissimilarities in a more sophis-
ticated manner. Another utility metric, the Fréchet Inception Distance (FID) indicator, is addi-
tionally taken to measure image quality in GANs. The Perception indistinguishability method [13]
shows higher SSIM and lower FID metrics than SVD-Privacy in [27], followed by pix DP [26]. This
is because GANs capture more human perception information and image semantics than singular
values and pixelization for image content before DP obfuscations are applied. Other methods to
extract key features such as discrete cosine transform (DCT) and discrete wavelet transform
(DWT) can replace SVD to represent images and work like SVD-Privacy. DP methods defined on
digital image signals such as pixels, singular values, and DCT generate obfuscated images in a
straightforward manner but suffer from low-level perceptual quality. Although DP methods de-
fined on neural networks synthesize high-level image quality, they undergo expensive and time-
consuming hyper-parameter tuning difficulties.
Since the popular obfuscation methods of mosaicing and blurring have been tested to lose func-
tionality against convolutional neural network (CNN) attacks, with original images recovered
in remarkably high accuracies up to 96% [63], DP obfuscation is evaluated by benchmarking with
them, assuming an adversary possesses DP-protected images and their labels in the training set. It

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:18 Y. Zhao and J. Chen

Table 4. Pros and Cons of Privacy Methods


Privacy Method Pros Cons
Pix DP [26] Straightforward obfuscation
Low-level perceptual information
SVD-Privacy [27] in digital image signals
Expensive hyper-parameter tuning,
Latent Vector DP [59] Realistic image
complex sensitivity calculation
Perception Indistinguishability [13] Realistic image Expensive hyper-parameter tuning
High precomputation overhead
Voice Indistinguishability [44] Effective utility-privacy trade-off
in large database
Video DP [104] High utility Unpredicted vulnerability to AI attacks
Object Indistinguishability [103] Robust to AI attacks Low utility in sparse crowds
Hamming Privacy [109] Effective utility-privacy tradeoff Expensive sensitivity calculation
Earth Mover’s Privacy [31]
Effective privacy-utility tradeoff Low utility
Euclidean Privacy [34]
Improved utility with
Euclidean Privacy [114] N/A
reduced substitution candidates
Improved utility with
Euclidean Privacy [120] sensitive and nonsensitive N/A
word division
Improved hierarchical structure
Hyperbolic Privacy [35] N/A
of private words
Mahalanobis Privacy [115] Improved privacy for sparse words Unpredicted vulnerability to AI attacks
Improved privacy with
Euclidean Privacy [116] N/A
discretization to top-k nearest words
Improved utility with
Any Distance Metric-based Privacy [36] Unpredicted vulnerability to AI attacks
vector dimensionality reduction

is shown that the adversary can identify labels of the same level DP-protected images in the testing
set with CNNs in lower accuracies. As lower re-identification accuracies indicate higher privacy
levels, the pix DP model [26] provides stronger protection than SVD-Privacy [27]. It is unknown
whether AI attacks pose significant privacy risks on GAN-based privacy methods in [13, 59].

4.2 Utility and Privacy Analysis in DP-protected Audio


In the Voice Indistinguishability scheme [44], the x-vector system [88] extracts features from
voiceprints and represents them with 512-dimensional vectors by using DNN embeddings. It could
further save some utility to develop more sophisticated speech embeddings to quantify voiceprints.
To evaluate utility, speaker verification accuracy (ACC) in probabilistic linear discriminant
analysis (PLDA) classifier [100], together with MSE, is adopted. MSE calculates the difference
between the original voiceprints and their obfuscations in the vector space, and ACC in the PLDA
metric measures how many obfuscated speeches composed of obfuscated voiceprints and original
contents can be verified. Lower MSE implies higher utility, while higher ACC does the same. Apart
from the aforementioned utility metrics with machine computation, speech dissimilarities between
the original and obfuscated are measured with the metric of humans’ opinion scores according to
humans’ perception. Another interesting metric, naturalness, is also taken to assess utility, which
can be only captured by humans. As for re-identification of obfuscated speech, also known as
speech recognition, three-layer bidirectional long short-term memory (BLSTM) neural net-
works [40] are trained with the connectionist-temporal-classification objective and character la-
bels. Label classification results are compared in speech recognition with the metric of character
error rate (CER). Overall low recognition rate of CER suggests Voice Indistinguishability is robust
against RNN-based attacks even in the low-privacy regime.

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:19

4.3 Utility and Privacy Analysis in DP-protected Video


In Video DP [104] and Object Indistinguishability [103], the objects are either extracted with RGB
color pixel models or represented with presence (1)/absence (0) status in the video, where no extra
utility is lost during the model representation process. Moreover, pixel sampling or frame sampling
is employed before DP noise addition in order to earn more utility for key components. The optimal
selections of the sample are constrained by the minimized utility loss.
For the measurement of the utility loss in Video DP, two metrics including KL divergence and
MSE are taken to calculate the RGB count distribution dissimilarities and RGB value differences
between the original video and its perturbation, respectively. Precision and recall accuracies also
evaluate the proportions of original objects among all the detected objects as well as the pro-
portions of detected original objects among all original objects by using state-of-the-art contour
detection algorithms [118]. The precision results show steadily high accuracies, while recall ones
show increasing trends with the increase of ϵ values. Concerning some statistical queries, such
as objects’ density and stay time in certain places, the Video DP scheme is compared with the
conventional privacy integrated queries system [65] and demonstrates better utility. Adversaries
equipped with the DNN deep learning method are tested to be relatively weaker, with the success
rate of approximately 20% in adapting themselves to the Video DP obfuscation in comparison with
the mosaicing blurring method.
The utility loss in the Object Indistinguishability scheme is specified with comparisons of
distinct object counts and object trajectories between original video and its randomization. The
trajectory deviation is measured between 0.1 and 0.2, while the count difference is estimated at
tens. Visual perception dissimilarities are also described with original and synthetic frames, where
synthetic frames are made up of randomized occurring objects and interpolated background
information. Overall, consistent with the fact that larger-scale datasets can render less utility loss
in aggregated statistics in LDP [20], videos with more objects sampled would suffer from less
utility loss. Since this scheme deals with the object presence/absence event rather than object
features, deep learning attacks are not examined here.

4.4 Utility and Privacy Analysis in DP-protected Text


Word embedding models such as word2vector, Glove, and FastText are used to represent discrete
words as real-numbered vectors in the continuous space, where the semantic similarity between
words are reflected by their vector distance [67]. More structural relationships between words
have been exploited to learn more effective word vectors. For instance, Poincaré embeddings [72]
employ a hierarchical semantic relationship between words to obtain generalized word vectors
and their subsets. With the adoption of Poincaré embeddings in Hyperbolic Privacy [35], utility
loss in text obfuscation is reduced. It is therefore a challenge to improve the quality of word vectors
in the text learning representation area for more utility-enhancing text privacy models. In Metric
DP-based text privacy models [31, 34–36, 114–116, 120], real-valued vectors are perturbed in the
continuous space and then transformed back to words in the discrete space. The perturbation
introduces some utility loss for trading privacy gains, while the discretization process loses some
information inevitably.
Basic privacy methods [31, 109] achieve effective privacy and utility tradeoffs with DP and met-
ric DP guarantees, respectively. The work [109] takes the F1 -score metric to evaluate classification
accuracy. It adopts the JStylo authorship attribution algorithm to model AI attacks and shows its
robustness to this attack. The other [31] is the baseline metric DP method based on Euclidean
distance. It takes counts of correct predication as utility measurements. More precisely, authors
employ author identification and topic classification tasks, including 1-NN and the Koppel algo-
rithm [55] for author identification, and 5-NN and fastText for topic classification.
ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:20 Y. Zhao and J. Chen

Lines of work including [34, 35] investigate utility increase by exploiting the structural distance
property of text embedding in metric space. The work [34] adopts Earth Mover’s distance to en-
hance utility. It uses binary classification, multi-class classification, and question-answering tasks
to demonstrate the utility loss. To quantify the utility losses, it takes classification accuracy, mean
average precision, and mean reciprocal rank. As for privacy evaluations, this work employs query
scrambling methods [89] to evaluate the realistic privacy loss. The other work [35] uses hierarchi-
cal relationships in word embedding to keep the hierarchical utility of perturbed words. Authors
carry out machine learning and natural language tasks to evaluate prediction accuracy scores by
comparing to a totally randomized baseline and no privacy word embeddings including InferSent,
SkipThought, and FastText. To confirm its privacy guarantees against the authorship attribution
attack, they use the traditional Koppel’s algorithm to adapt to the obfuscation.
Within the Euclidean-distance-based privacy model, studies [114] and [120] target enhancing
text utility. The authors in [114] reduce the range of substitution candidates before noise injec-
tion. The robustness performance to adversarial attacks is shown in comparison with the baseline
Laplace perturbation. The authors in [120] specify text sensitivity levels and only perturb highly
sensitive ones. Especially, they use BERT embedding for its superiority to conventional embed-
dings. They also measure model robustness to a mask token inference attack.
Researchers [115] and [116] focus on privacy improvements when texts are obfuscated to be
themselves. In [115], since the key contribution is the boosted privacy for words in sparse regions,
privacy guaranteed by the Mahalanobis mechanism is compared with that by multivariate Laplace
mechanisms in [31, 34]. More precisely, the probability of words in sparse regions perturbed to
other words rather than themselves is calculated. As for utility, common classification metrics in-
cluding accuracy, precision, and recall are adopted when comparing multivariant Laplace, regular-
ized Mahalanobis, and pure Mahalanobis mechanisms. In [116], the insight to improve text privacy
is to discretize private embeddings to their top-k nearest words when DP noise is small. Regarding
utility measurements, inference error is compared with another privacy-enhanced method (i.e.,
Mahalanobis method) and the baseline multi-variant Laplace method.
The study [36] can be applied on any distance-metric-based privacy models for text. Its funda-
mental contribution is to reduce the high dimensionality of initial text embeddings for improved
utility. Some utility loss is caused in random projections; however, it is over-weighted by utility
gain from less DP noise addition. On one hand, empirical study is conducted on embedding models.
Here utility errors are measured with differences between Euclidean distance (respectively, inner
product) of real vector pairs and Euclidean distance (respectively, inner product) of private vector
pairs. Utility is also evaluated with accuracy and AUC metrics in the SVM linear classification task.
On the other hand, empirical study is applied on the NLP database, and train and test accuracies
make up evaluation metrics.

5 FUTURE DIRECTIONS
5.1 Adaptive Differential Privacy Models for Complex Data
Differential privacy has always been studied in a variety of structured data, including numerical,
categorical, and key-value data [78, 98, 119]. While it provides rigorous privacy guarantees for
structured data, it is challenging to extend differential privacy models for a broader context, for
instance, unstructured data or data that are complex and qualitative in nature, in order to ensure
rigorous privacy guarantees for them.
Recently, besides unstructured data in this work, differential privacy has been applied to other
rich data. Several investigations studied formal privacy guarantees for individuals’ physiological
and psychological information reflected by eye gaze data [8, 58, 60, 92]. Specifically, Steil et al. [92]

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:21

studied a differential privacy framework over eye movement data for virtual reality (VR) users.
Liu et al. [60] customized a differential privacy model for gaze maps on eye tracking heatmaps.
Bozkir et al. [8] proposed a Discrete Fourier Transformation-based differential privacy model for
temporal related eye movement features. Another work [58] used the geo-indistinguishability con-
cept and w-event differential privacy scheme to protect time series gaze data points as the tem-
poral gaze trajectory. Of great significance is that the authors also developed a practical system
to integrate the proposed privacy protection technology with current eye-tracking ecosystems.
Another work [105] proposed a matrix-based differential privacy model for fingerprint identifica-
tion. These works focused on specific rich data and customized differential privacy definitions for
them according to their specific privacy requirements. It is challenging to further explore other
privacy scenarios that need rigorous protection, such as complex scenarios in IoT systems [9, 17]
and smart grid systems [45]. It is demanding for researchers to specify what to protect in those
contexts, explore the special structure of the underlying data domain, and then adjust DP goals
with the incorporation of context-specific restrictions.

5.2 Privacy Mechanisms Design


Randomized mechanisms in different privacy scenarios are mainly based on standard DP mecha-
nisms, including the Laplace mechanism, Exponential mechanism, and randomized response. For
complex data types such as unstructured data, the first step is to decide on the sensitive data. The
next step is to represent unstructured data with a structured format such as real-valued or bit-
valued vectors. Mechanism design for unstructured data is thus converted to design appropriate
mechanisms for structured data.
Currently, a rich set of work has proposed a variety of utility-enhanced mechanisms for struc-
tured data. Balle et al. [5] proposed shuffled DP mechanisms to improve data utility within small-
scale dataset scenarios. Several works [41, 69] studied utility-optimized randomized response
mechanisms by considering different privacy requirements from data with different sensitivity
levels. Holohan et al. [48] proposed improved Laplace mechanisms to achieve DP in more rigor-
ous ways. Liu [61] introduced generalized Laplace mechanisms to improve statistical utility. It is
promising to apply these improved mechanisms for unstructured data in order to further improve
data utility.
Another line of work has paid attention to reduce high utility loss caused by the curse of
dimensionality in DP-preserving high-dimensional structured data. The widely used dimension-
ality reduction method of the Johnson-Lindenstrauss transform has been proved to preserve DP
in [6, 54]. It was further developed to maximize utility under DP constraints recently in [91].
A previous work [80] adopted Discrete Fourier Transform to reduce utility error in long query
sequences on time series data, while a similar method was applied on high-dimensional eye
movement features [8]. As key ingredients of unstructured data are always represented in high-
dimensional spaces, it may gain extra utility to use these techniques after vector representation
of unstructured data.

5.3 Vectorization Representations


Vectorization representations of unstructured data are a key step to extract structured data from
unstructured data. Many works have worked on vector representations in order to retain more
information on unstructured data. They are also meaningful for improving data utility after per-
turbation. Wang et al. [105] proposed to use truncated SVD methods rather than standard SVD
to represent principal features of images. Chen et al. [13] adopted GANs to represent facial se-
mantic attributes and synthesize realistic facial images. Ganea et al. [39] proposed hierarchical
relationship-based vector representation methods to capture more semantic meanings between

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:22 Y. Zhao and J. Chen

words. Snyder et al. [88] proposed x-vector embeddings for variable-length speech segments to
capture utterance-level representations. Since the work [35] has proven that Hyperbolic-model-
based representations could improve text data utility and privacy guarantees, it is promising to
improve privacy-utility tradeoffs by using more information-captured data representation meth-
ods when considering unstructured data privacy.

5.4 Metric Differential Privacy


Metric DP has extended Hamming-distance-based DP to arbitrary metric-space-based privacy no-
tions [11]. It has shown great potential in providing privacy guarantees for a wider range of
data scenarios besides unstructured data in this work. Many works focused on its applications in
location-based services [3, 66, 95]. Several works studied its applications in general linear and range
queries [52, 112]. The work [32] applied Metric DP to similarity matching and nearest searching
scenarios [75]. Fan and Bonomi [29] used Metric DP to sanitize time series data. Gursoy et al. [42]
studied Metric DP for ordinal and non-ordinal items. Schein et al. [84] examined Metric DP for
sparse count-valued documents. It is a challenge to explore the potential of Metric DP in more
scenarios. For example, it is interesting to further explore the combination of Metric DP and other
popular probabilistic encoding techniques such as locality-sensitive hashing [121] and Bloom fil-
ter [117] to provide privacy guarantees for data matching scenarios.
The similar structure in Metric DP that indistinguishability levels of private inputs vary with
distance between them can be coupled with context-aware DP in which indistinguishability lev-
els of private inputs change with their own sensitivity levels. In [1], Acharya et al. proposed a
context-aware LDP variant to allow different pairwise inputs in the data domain to have distinct
privacy levels rather than equal ones in standard LDP. Similarly, Gu et al. [41] identified input
data pairs that are in demand for different privacy levels. For instance, elements such as HIV or
cancer require stricter privacy protection than those that are less sensitive such as toothache or
flu. Further, they proposed that the privacy budget is a function of privacy budget pairs and set an
improved LDP goal named input-discriminative LDP to provide more practical privacy protections
for a pair of inputs. On the other hand, authors in [56] showed that privacy levels are dependent
on the neighboring density around the data itself. Another study [57] showed that indistinguisha-
bility levels of input pairs vary across social networks on the basis of users’ social proximity. It
will bring new insights to examine special sensitivity requirements in a variety of scenarios and
design improved DP aims and privacy mechanisms.

5.5 Differential Privacy against AI Attacks


Although DP guarantees that adversaries cannot obtain additional information with the inclusion
or exclusion of one element, it cannot prevent adversaries from gaining knowledge with publicly
released data itself. To this end, adversaries may manage to infer or identify sensitive informa-
tion from the released data, especially equipped with AI tools. Re-identification attacks refer to
tracing an individual record back to its human source [24]. Dwork et al. [24] once studied the
robustness of DP in thwarting re-identification attacks. While other privacy-protecting methods
such as blurring for image data are demonstrated to be vulnerable to deep learning attacks [47],
it is necessary to theoretically or practically study the privacy guarantees of DP in mitigating
deep-learning-based re-identification attacks. Inference attack [43, 87] is another effective attack
that may pose significant privacy risks to DP-protected data. It has been identified in [43] that DP-
protected unstructured data may be susceptible to inference attacks. Future research is encouraged
to empirically study the robustness of privacy methods against inference attacks. It is also of great
importance to integrate some other methods suggested in [43] to boost privacy in designing DP
privacy methods.

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:23

6 CONCLUSIONS
Privacy concerns arise with the millions of real-world unstructured data including images, audios,
videos, and texts. Differential privacy has emerged as a de facto standard to protect a wide range
of data types. This article has presented differential privacy methods for sensitive content in these
unstructured data. We have identified that unstructured data are obfuscated based on appropriate
vector representations, specific privacy models, and vector reconstructions. We have discussed
possible challenges faced with these privacy methods. We have provided an overview of privacy
guarantees and utility losses in these methods and possible approaches to improve data utility.
Differential privacy still has much unexplored potential in providing provable privacy guar-
antees for a broader range of data scenarios. This survey intends to present a summary of its
effective protection for unstructured data content and point out several opportunities in future
developments around this topic. It is meaningful to further explore the promising robustness of
DP to handle other privacy issues.

REFERENCES
[1] Jayadev Acharya, Kallista Bonawitz, Peter Kairouz, Daniel Ramage, and Ziteng Sun. 2020. Context aware local dif-
ferential privacy. In 37th International Conference on Machine Learning (ICML’20). 52–62.
[2] Balamurugan Anandan, Chris Clifton, Wei Jiang, Mummoorthy Murugesan, Pedro Pastrana-Camacho, and Luo Si.
2012. t-Plausibility: Generalizing words to desensitize text. Transactions on Data Privacy 5, 3 (2012), 505–534.
[3] Miguel E. Andrés, Nicolás E. Bordenabe, Konstantinos Chatzikokolakis, and Catuscia Palamidessi. 2013. Geo-
indistinguishability: Differential privacy for location-based systems. In Proc. of the 20th ACM SIGSAC Conference
on Computer and Communications Security (CCS’13). 901–914.
[4] Brendan Avent, Aleksandra Korolova, David Zeber, Torgeir Hovden, and Benjamin Livshits. 2017. BLENDER: En-
abling local search with a hybrid differential privacy model. In 26th USENIX Security Symposium. 747–764.
[5] Borja Balle, James Bell, Adria Gascon, and Kobbi Nissim. 2020. Private summation in the multi-message shuffle model.
In Proc. of the 27th ACM SIGSAC Conference on Computer and Communications Security (CCS’20). 657–676.
[6] Jeremiah Blocki, Avrim Blum, Anupam Datta, and Or Sheffet. 2012. The Johnson-Lindenstrauss transform itself
preserves differential privacy. In 53rd IEEE Annual Symposium on Foundations of Computer Science (FOCS’12).
410–419.
[7] Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2017. Enriching word vectors with subword
information. Transactions of the Association for Computational Linguistics 5 (2017), 135–146.
[8] Efe Bozkir, Onur Gunlu, Wolfgang Fuhl, Rafael F. Schaefer, and Enkelejda Kasneci. 2020. Differential Privacy for eye
tracking with temporal correlations. Cryptology ePrint Archive, Report 2020/340. https://eprint.iacr.org/2020/340.
[9] Xingjuan Cai, Shaojin Geng, Di Wu, Jianghui Cai, and Jinjun Chen. 2021. A multi-cloud model based many-objective
intelligent algorithm for efficient task scheduling in Internet of Things. IEEE Internet of Things Journal 8, 12 (2021),
9645–9653. DOI:10.1109/JIOT.2020.3040019
[10] Frank Cangialosi, Neil Agarwal, Venkat Arun, Junchen Jiang, Srinivas Narayana, Anand Sarwate, and Ravi Netravali.
2022. Privid: Practical, privacy-preserving video analytics queries. In 19th USENIX Symposium on Networked Systems
Design and Implementation (NSDI). 209–228.
[11] Konstantinos Chatzikokolakis, Miguel E. Andrés, Nicolás Emilio Bordenabe, and Catuscia Palamidessi. 2013. Broaden-
ing the scope of differential privacy using metrics. In The 13th Privacy Enhancing Technologies Symposium (PETS’13).
82–102.
[12] Konstantinos Chatzikokolakis, Catuscia Palamidessi, and Marco Stronati. 2015. Constructing elastic distinguishabil-
ity metrics for location privacy. Proceedings on Privacy Enhancing Technologies. 2 (2015), 156–170.
[13] Jia-Wei Chen, Li-Ju Chen, Chia-Mu Yu, and Chun-Shien Lu. 2021. Perceptual Indistinguishability-Net (PI-Net): Facial
image obfuscation with manipulable semantics. In Proc. of the 34th IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR’21). 6478–6487.
[14] Raj Chetty and John N. Friedman. 2019. A practical method to reduce privacy loss when disclosing statistics based
on small samples. AEA Papers and Proceedings 109 (2019), 414–20.
[15] Marcus Cuda Christoph Ruegg and Jurgen Van Gael. 2009. Distance metrics. [EB/OL]. https://numerics.mathdotnet.
com/Distance.html.
[16] Graham Cormode, Somesh Jha, Tejas Kulkarni, Ninghui Li, Divesh Srivastava, and Tianhao Wang. 2018. Privacy
at scale: Local differential privacy in practice. In Proc. of the 44th International Conference on Management of Data
(SIGMOD’18). 1655–1658.

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:24 Y. Zhao and J. Chen

[17] Zhihua Cui, Fei Xue, Shiqiang Zhang, Xingjuan Cai, Yang Cao, Wensheng Zhang, and Jinjun Chen. 2020. A hy-
brid blockchain-based identity authentication scheme for multi-WSN. IEEE Transactions on Services Computing 13, 2
(2020), 241–251.
[18] Fida Kamal Dankar and Khaled El Emam. 2013. Practicing differential privacy in health care: A review. Transactions
on Data Privacy 6, 1 (2013), 35–67.
[19] Damien Desfontaines and Balázs Pejó. 2020. SoK: Differential privacies. Proceedings on Privacy Enhancing Technolo-
gies 2020, 2 (2020), 288–313.
[20] John C. Duchi, Michael I. Jordan, and Martin J. Wainwright. 2013. Local privacy and statistical minimax rates. In 54th
IEEE Annual Symposium on Foundations of Computer Science (FOCS’13). 429–438.
[21] Cynthia Dwork. 2008. Differential privacy: A survey of results. Theory and Applications of Models of Computation
4978 (2008), 1–19.
[22] Cynthia Dwork. 2009. The differential privacy frontier. In The 6th Theory of Cryptography Conference. 496–502.
[23] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. 2006. Calibrating noise to sensitivity in private
data analysis. In The 3th Theory of Cryptography Conference (TCC’06). 265–284.
[24] Cynthia Dwork, Adam Smith, Thomas Steinke, and Jonathan Ullman. 2017. Exposed! A survey of attacks on private
data. Annual Review of Statistics and Its Application 4, 1 (2017), 61–84.
[25] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. 2014. RAPPOR: Randomized aggregatable privacy-
preserving ordinal response. In Proc. of the 21st ACM SIGSAC Conference on Computer and Communications Security
(CCS’14). 1054–1067.
[26] Liyue Fan. 2018. Image pixelization with differential privacy. In 32nd Annual IFIP WG 11.3 Conference on Data and
Applications Security and Privacy (DBSec’18). 148–162.
[27] Liyue Fan. 2019. Practical image obfuscation with provable privacy. In 20th IEEE International Conference on Multi-
media and Expo (ICME’19). 784–789.
[28] Liyue Fan. 2020. A survey of differentially private generative adversarial networks. In The AAAI Workshop on Privacy-
Preserving Artificial Intelligence.
[29] Liyue Fan and Luca Bonomi. 2018. Time series sanitization with metric-based privacy. In 6th IEEE International
Congress on Big Data. 264–267.
[30] Fuming Fang, Xin Wang, Junichi Yamagishi, Isao Echizen, Massimiliano Todisco, Nicholas Evans, and Jean-Francois
Bonastre. 2019. Speaker anonymization using x-vector and neural waveform models. In The 10th ISCA Speech Syn-
thesis Workshop. 155–160.
[31] Natasha Fernandes, Mark Dras, and Annabelle McIver. 2019. Generalised differential privacy for text document pro-
cessing. In The 8th International Conference on Principles of Security and Trust (POST’19). 123–148.
[32] Natasha Fernandes, Yusuke Kawamoto, and Takao Murakami. 2020. Locality sensitive hashing with extended differ-
ential privacy. arXiv preprint, arXiv:2010.09393 (2020).
[33] Oluwaseyi Feyisetan, Abhinav Aggarwal, Zekun Xu, and Nathanael Teissier. 2020. Research challenges in designing
differentially private text generation mechanisms. arXiv preprint, arXiv:2012.05403 (2020).
[34] Oluwaseyi Feyisetan, Borja Balle, Thomas Drake, and Tom Diethe. 2020. Privacy-and utility-preserving textual anal-
ysis via calibrated multivariate perturbations. In The 13th ACM International Conference on Web Search and Data
Mining (WSDM’20). 178–186.
[35] Oluwaseyi Feyisetan, Tom Diethe, and Thomas Drake. 2019. Leveraging hierarchical representations for preserving
privacy and utility in text. In 19th IEEE International Conference on Data Mining (ICDM’19). 210–219.
[36] Oluwaseyi Feyisetan and Shiva Kasiviswanathan. 2021. Private release of text embedding vectors. In Proc. of the 1st
Workshop on Trustworthy Natural Language Processing. 15–27.
[37] Sam Fletcher and Md Zahidul Islam. 2019. Decision tree classification with differential privacy: A survey. ACM
Computing Surveys 52, 4 (2019), 1–33.
[38] Andrea Frome, German Cheung, Ahmad Abdulkader, Marco Zennaro, Bo Wu, Alessandro Bissacco, Hartwig Adam,
Hartmut Neven, and Luc Vincent. 2009. Large-scale privacy protection in Google street view. In 12th IEEE Interna-
tional Conference on Computer Vision (ICCV’09). 2373–2380.
[39] Octavian Ganea, Gary Becigneul, and Thomas Hofmann. 2018. Hyperbolic entailment cones for learning hierarchical
embeddings. In The 35th International Conference on Machine Learning (ICML’18). 1646–1655.
[40] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classifi-
cation: Labelling unsegmented sequence data with recurrent neural networks. In The 23rd International Conference
on Machine Learning (ICML’06). 369–376.
[41] Xiaolan Gu, Ming Li, Li Xiong, and Yang Cao. 2020. Providing input-discriminative protection for local differential
privacy. In 36th IEEE International Conference on Data Engineering (ICDE’20). 505–516.
[42] Mehmet Emre Gursoy, Acar Tamersoy, Stacey Truex, Wenqi Wei, and Ling Liu. 2021. Secure and utility-aware data
collection with condensed local differential privacy. IEEE Transactions on Dependable and Secure Computing 18, 5
(2021), 2365–2378. DOI:10.1109/TDSC.2019.2949041

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:25

[43] Jihun Hamm. 2017. Minimax filter: Learning to preserve privacy from inference attacks. Journal of Machine Learning
Research 18, 1 (2017), 4704–4734.
[44] Yaowei Han, Sheng Li, Yang Cao, Qiang Ma, and Masatoshi Yoshikawa. 2020. Voice-indistinguishability: Protecting
voiceprint in privacy-preserving speech data release. In 21st IEEE International Conference on Multimedia and Expo
(ICME’20). 1–6.
[45] Muneeb Ul Hassan, Mubashir Husain Rehmani, and Jinjun Chen. 2020. DEAL: Differentially private auction for
blockchain-based microgrids energy trading. IEEE Transactions on Services Computing 13, 2 (2020), 263–275.
[46] Muneeb Ul Hassan, Mubashir Husain Rehmani, and Jinjun Chen. 2020. Differential privacy techniques for cyber
physical systems: A survey. IEEE Communications Surveys & Tutorials 22, 1 (2020), 746–789.
[47] Steven Hill, Zhimin Zhou, Lawrence Saul, and Hovav Shacham. 2016. On the (in) effectiveness of mosaicing and
blurring as tools for document redaction. Proceedings on Privacy Enhancing Technologies 2016, 4 (2016), 403–417.
[48] Naoise Holohan, Spiros Antonatos, Stefano Braghin, and Pól Mac Aonghusa. 2020. The bounded Laplace mechanism
in differential privacy. Journal of Privacy and Confidentiality 10, 1 (2020), 1–15. DOI:https://doi.org/10.29012/jpc.715
[49] Honglu Jiang, Jian Pei, Dongxiao Yu, Jiguo Yu, Bei Gong, and Xiuzhen Cheng. 2021. Applications of differen-
tial privacy in social network analysis: A survey. IEEE Transactions on Knowledge & Data Engineering (TKDE’21).
DOI:10.1109/TKDE.2021.3073062
[50] Woohwan Jung, Suyong Kwon, and Kyuseok Shim. 2021. TIDY: Publishing a time interval dataset with differential
privacy. IEEE Transactions on Knowledge and Data Engineering (TKDE) 33, 5 (2021), 2280–2294.
[51] Tadej Justin, Vitomir Štruc, Simon Dobrišek, Boštjan Vesnicer, Ivo Ipšić, and France Mihelič. 2015. Speaker de-
identification using diphone recognition and speech synthesis. In The 11th IEEE International Conference and Work-
shops on Automatic Face and Gesture Recognition (FG’15). 1–7.
[52] Parameswaran Kamalaruban, Victor Perrier, Hassan Jameel Asghar, and Mohamed Ali Kaafar. 2020. Not all attributes
are created equal: d X -private mechanisms for linear queries. Proceedings on Privacy Enhancing Technologies 2020, 1
(2020), 103–125.
[53] Shiva Prasad Kasiviswanathan, Homin K. Lee, Kobbi Nissim, Sofya Raskhodnikova, and Adam Smith. 2011. What
can we learn privately? SIAM Journal on Computing 40, 3 (2011), 793–826.
[54] Krishnaram Kenthapadi, Aleksandra Korolova, Ilya Mironov, and Nina Mishra. 2013. Privacy via the Johnson-
Lindenstrauss transform. Journal of Privacy and Confidentiality 5, 1 (2013), 39–71.
[55] Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2011. Authorship attribution in the wild. Language Resources
and Evaluation 45, 1 (2011), 83–94.
[56] Fragkiskos Koufogiannis and George J. Pappas. 2016. Location-dependent privacy. In 55th IEEE Conference on Decision
and Control (CDC’16). 7586–7591.
[57] Fragkiskos Koufogiannis and George J. Pappas. 2016. Multi-owner multi-user privacy. In 55th IEEE Conference on
Decision and Control (CDC’16). 1787–1793.
[58] Jingjie Li, Amrita Roy Chowdhury, Kassem Fawaz, and Younghyun Kim. 2021. Kalε ido: Real-time privacy control for
eye-tracking systems. In The 30th USENIX Security Symposium.
[59] Tao Li and Chris Clifton. 2021. Differentially private imaging via latent space manipulation. arXiv preprint arXiv:
2103.05472 (2021).
[60] Ao Liu, Lirong Xia, Andrew Duchowski, Reynold Bailey, Kenneth Holmqvist, and Eakta Jain. 2019. Differential pri-
vacy for eye-tracking data. In The 11th ACM Symposium on Eye Tracking Research & Applications (ETRA’19). 28:1–
28:10.
[61] Fang Liu. 2018. Generalized Gaussian mechanism for differential privacy. IEEE Transactions on Knowledge and Data
Engineering (TKDE) 31, 4 (2018), 747–756.
[62] Naurang S. Mangat. 1994. An improved randomized response strategy. Journal of the Royal Statistical Society: Series
B (Methodological) 56, 1 (1994), 93–95.
[63] Richard McPherson, Reza Shokri, and Vitaly Shmatikov. 2016. Defeating image obfuscation with deep learning. arXiv
preprint, arXiv:1609.00408 (2016).
[64] Frank McSherry and Kunal Talwar. 2007. Mechanism design via differential privacy. In The 48th Annual IEEE Sym-
posium on Foundations of Computer Science (FOCS’07). 94–103.
[65] Frank D. McSherry. 2009. Privacy integrated queries: An extensible platform for privacy-preserving data analysis. In
Proc. of the 35th ACM SIGMOD International Conference on Management of Data (SIGMOD’09). 19–30.
[66] Ricardo Mendes, Mariana Cunha, and João P. Vilela. 2020. Impact of frequency of location reports on the privacy
level of geo-indistinguishability. Proceedings on Privacy Enhancing Technologies 2020, 2 (2020), 379–396.
[67] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in
vector space. arXiv preprint, arXiv:1301.3781 (2013).
[68] Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of
words and phrases and their compositionality. In Proc. of the 27th Annual Conference on Neural Information Processing
Systems (NeurIPS’13). 3111–3119.

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:26 Y. Zhao and J. Chen

[69] Takao Murakami and Yusuke Kawamoto. 2019. Utility-optimized local differential privacy mechanisms for distribu-
tion estimation. In The 28th USENIX Security Symposium. 1877–1894.
[70] A. Nautsch, C. Jasserand, Els Kindt, M. Todisco, I. Trancoso, and N. Evans. 2019. The GDPR & speech data: Reflec-
tions of legal and technology communities, first steps towards a common understanding. In Proc. of the 20th Annual
Conference of the International Speech Communication Association (INTERSPEECH’19). 3695–3699.
[71] Boel Nelson and Jenni Reuben. 2019. Chasing accuracy and privacy, and catching both: A literature survey on differ-
entially private histogram publication. arXiv preprint, arXiv:1910.14028 (2019).
[72] Maximillian Nickel and Douwe Kiela. 2017. Poincaré embeddings for learning hierarchical representations. In The
31st Annual Conference on Neural Information Processing Systems (NeurIPS’17). 6338–6347.
[73] Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representa-
tion. In The 19th Conference on Empirical Methods in Natural Language Processing (EMNLP’14). 1532–1543.
[74] Rishabh Poddar, Ganesh Ananthanarayanan, Srinath Setty, Stavros Volos, and Raluca Ada Popa. 2020. Visor: Privacy-
preserving video analytics as a cloud service. In The 29th USENIX Security Symposium. 1039–1056.
[75] Lianyong Qi, Xuyun Zhang, Wanchun Dou, Chunhua Hu, Chi Yang, and Jinjun Chen. 2018. A two-stage locality-
sensitive hashing based approach for privacy-preserving mobile service recommendation in cross-platform edge
environment. Future Generation Computer Systems 88 (2018), 636–643.
[76] Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, and Xiangyang Li. 2019. Speech sanitizer: Speech
content desensitization and voice anonymization. IEEE Transactions on Dependable and Secure Computing 18, 6 (2019),
2631–2642. DOI:10.1109/TDSC.2019.2960239
[77] Jianwei Qian, Haohua Du, Jiahui Hou, Linlin Chen, Taeho Jung, and Xiang-Yang Li. 2018. Hidebehind: Enjoy voice
input with voiceprint unclonability and anonymity. In The 16th ACM Conference on Embedded Networked Sensor
Systems (SenSys’18). 82–94.
[78] Youyang Qu, Shui Yu, Wanlei Zhou, Shiping Chen, and Jun Wu. 2021. Customizable reliable privacy-preserving data
sharing in cyber-physical social network. IEEE Transactions on Network Science and Engineering 8, 1 (2021), 269–281.
[79] Youyang Qu, Shui Yu, Wanlei Zhou, and Yonghong Tian. 2020. GAN-driven personalized spatial-temporal private
data sharing in cyber-physical social systems. IEEE Transactions on Network Science and Engineering 7, 4 (2020),
2576–2586.
[80] Vibhor Rastogi and Suman Nath. 2010. Differentially private aggregation of distributed time-series with transforma-
tion and encryption. In Proc. of the 36th ACM SIGMOD International Conference on Management of Data (SIGMOD’10).
735–746.
[81] Mercedes Rodriguez-Garcia, Montserrat Batet, and David Sánchez. 2015. Semantic noise: Privacy-protection of nom-
inal microdata through uncorrelated noise addition. In The 27th IEEE International Conference on Tools with Artificial
Intelligence (ICTAI’15). 1106–1113.
[82] Mukesh Saini, Pradeep K. Atrey, Sharad Mehrotra, and Mohan Kankanhalli. 2014. W3-privacy: Understanding what,
when, and where inference channels in multi-camera surveillance video. Multimedia Tools and Applications 68, 1
(2014), 135–158.
[83] David Sánchez and Montserrat Batet. 2016. C-sanitized: A privacy model for document redaction and sanitization.
Journal of the Association for Information Science and Technology 67, 1 (2016), 148–163.
[84] Aaron Schein, Zhiwei Steven Wu, Alexandra Schofield, Mingyuan Zhou, and Hanna Wallach. 2019. Locally private
Bayesian inference for count models. In The 36th International Conference on Machine Learning. 5638–5648.
[85] Andrew Senior, Sharath Pankanti, Arun Hampapur, Lisa Brown, Ying-Li Tian, Ahmet Ekin, Jonathan Connell,
Chiao Fe Shu, and Max Lu. 2005. Enabling video privacy through computer vision. IEEE Security & Privacy 3, 3
(2005), 50–57.
[86] Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In Proc. of the 22nd ACM SIGSAC Confer-
ence on Computer and Communications Security (CCS’15). 1310–1321.
[87] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Membership inference attacks against
machine learning models. In 38th IEEE Symposium on Security and Privacy (S&P’17). 3–18.
[88] David Snyder, Daniel Garcia-Romero, Gregory Sell, Daniel Povey, and Sanjeev Khudanpur. 2018. x-vectors: Robust
DNN embeddings for speaker recognition. In 43rd IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP’18). 5329–5333.
[89] Congzheng Song and Vitaly Shmatikov. 2019. Auditing data provenance in text-generation models. In The 25th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19). 196–206.
[90] Brij Mohan Lal Srivastava, Nathalie Vauquier, Md Sahidullah, Aurélien Bellet, Marc Tommasi, and Emmanuel Vincent.
2020. Evaluating voice conversion-based privacy protection against informed attackers. In The 45th IEEE International
Conference on Acoustics, Speech and Signal Processing (ICASSP’20). 2802–2806.
[91] Nina Mesing Stausholm. 2021. Improved differentially private Euclidean distance approximation. In Proc. of the 40th
ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS’21). 42–56.

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
A Survey on Differential Privacy for Unstructured Data Content 207:27

[92] Julian Steil, Inken Hagestedt, Michael Xuelin Huang, and Andreas Bulling. 2019. Privacy-aware eye tracking using
differential privacy. In The 11th ACM Symposium on Eye Tracking Research & Applications (ETRA’19). 27:1–27:9.
[93] Mingxuan Sun, Qing Wang, and Zicheng Liu. 2020. Human action image generation with differential privacy. In The
21st IEEE International Conference on Multimedia and Expo (ICME’20). 1–6.
[94] Qianru Sun, Liqian Ma, Seong Joon Oh, Luc Van Gool, Bernt Schiele, and Mario Fritz. 2018. Natural and effective
obfuscation by head inpainting. In Proc. of the 21st IEEE Conference on Computer Vision and Pattern Recognition
(CVPR’18). 5050–5059.
[95] Shun Takagi, Yang Cao, Yasuhito Asano, and Masatoshi Yoshikawa. 2019. Geo-graph-indistinguishability: Protecting
location privacy for LBS over road networks. In 33rd IFIP Annual WG 11.3 Conference on Data and Applications
Security and Privacy (DBSec’19). 143–163.
[96] Manolis Terrovitis, Nikos Mamoulis, and Panos Kalnis. 2008. Privacy-preserving anonymization of set-valued data.
Proceedings of the VLDB Endowment 1, 1 (2008), 115–125.
[97] Florian Tramér, Zhicong Huang, Jean-Pierre Hubaux, and Erman Ayday. 2015. Differential privacy with bounded pri-
ors: Reconciling utility and privacy in genome-wide association studies. In Proc. of the 22nd ACM SIGSAC Conference
on Computer and Communications Security (CCS’15). 1286–1297.
[98] Michael Carl Tschantz, Shayak Sen, and Anupam Datta. 2020. SoK: Differential privacy as a causal property. In 41st
IEEE Symposium on Security and Privacy (S&P’20). 354–371.
[99] William Vickrey. 1961. Counterspeculation, auctions, and competitive sealed tenders. Journal of Finance 16, 1 (1961),
8–37.
[100] Jesús Villalba, Nanxin Chen, David Snyder, Daniel Garcia-Romero, Alan McCree, Gregory Sell, Jonas Borgstrom,
Leibny Paola García-Perera, Fred Richardson, Réda Dehak, et al. 2020. State-of-the-art speaker recognition with
neural network embeddings in NIST SRE18 and speakers in the wild evaluations. Computer Speech & Language 60
(2020), 101026. DOI:https://doi.org/10.1016/j.csl.2019.101026
[101] Sameer Wagh, Xi He, Ashwin Machanavajjhala, and Prateek Mittal. 2021. DP-cryptography: Marrying differential
privacy and cryptography in emerging applications. Communications of the ACM 64, 2 (2021), 84–93.
[102] Isabel Wagner and David Eckhoff. 2018. Technical privacy metrics: A systematic survey. ACM Computing Surveys 51,
3 (2018), 1–38.
[103] Han Wang, Yuan Hong, Yu Kong, and Jaideep Vaidya. 2020. Publishing video data with indistinguishable objects. In
Proc. of the 23rd International Conference on Extending Database Technology (EDBT’20). 323–334.
[104] Han Wang, Shangyu Xie, and Yuan Hong. 2020. VideoDP: A universal platform for video analytics with differential
privacy. Proceedings on Privacy Enhancing Technologies 2020, 4 (2020), 277–296.
[105] Tao Wang, Zhigao Zheng, Ali Kashif Bashir, Alireza Jolfaei, and Yanyan Xu. 2021. FinPrivacy: A privacy-preserving
mechanism for fingerprint identification. ACM Transactions on Internet Technology 21, 3 (2021). DOI:10.1145/3387130
[106] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. 2004. Image quality assessment: From error
visibility to structural similarity. IEEE Transactions on Image Processing 13, 4 (2004), 600–612.
[107] Zhibo Wang, Hengchang Guo, Zhifei Zhang, Mengkai Song, Siyan Zheng, Qian Wang, and Ben Niu. 2020. Towards
compression-resistant privacy-preserving photo sharing on social networks. In The 21st International Symposium on
Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing (Mobihoc’20). 81–90.
[108] Stanley L. Warner. 1965. Randomized response: A survey technique for eliminating evasive answer bias. Journal of
the American Statistical Association 60, 309 (1965), 63–69.
[109] Benjamin Weggenmann and Florian Kerschbaum. 2018. SynTF: Synthetic and differentially private term frequency
vectors for privacy-preserving text mining. In The 41st International ACM SIGIR Conference on Research & Develop-
ment in Information Retrieval (SIGIR’18). 305–314.
[110] Bingzhe Wu, Shiwan Zhao, Guangyu Sun, Xiaolu Zhang, Zhong Su, Caihong Zeng, and Zhihong Liu. 2019. P3SGD:
Patient privacy preserving SGD for regularizing deep CNNs in pathological image classification. In Proc. of the 27th
IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19). 2099–2108.
[111] Zhenyu Wu, Zhangyang Wang, Zhaowen Wang, and Hailin Jin. 2018. Towards privacy-preserving visual recognition
via adversarial training: A pilot study. In The 15th European Conference on Computer Vision (ECCV’18). 606–624.
[112] Zhuolun Xiang, Bolin Ding, Xi He, and Jingren Zhou. 2020. Linear and range counting under metric-based local
differential privacy. In 17th IEEE International Symposium on Information Theory (ISIT’20). 908–913.
[113] Chuan Xu, Li Luo, Yingyi Ding, Guofeng Zhao, and Shui Yu. 2020. Personalized location privacy protection for
location-based services in vehicular networks. IEEE Wireless Communications Letters 9, 10 (2020), 1633–1637.
[114] Nan Xu, Oluwaseyi Feyisetan, Abhinav Aggarwal, Zekun Xu, and Nathanael Teissier. 2021. Density-aware dif-
ferentially private textual perturbations using truncated Gumbel noise. Proceedings of FLAIRS 34, 1 (2021), 1–8.
DOI:https://doi.org/10.32473/flairs.v34i1.128463
[115] Zekun Xu, Abhinav Aggarwal, Oluwaseyi Feyisetan, and Nathanael Teissier. 2020. A differentially private text per-
turbation method using a regularized Mahalanobis metric. In Proc. of the 2nd Workshop on PrivateNLP at the 25th
Conference on Empirical Methods in Natural Language Processing (EMNLP’20). 7–17.

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.
207:28 Y. Zhao and J. Chen

[116] Zekun Xu, Abhinav Aggarwal, Oluwaseyi Feyisetan, and Nathanael Teissier. 2021. On a utilitarian approach to pri-
vacy preserving text generation. In Proc. of the 3rd Workshop on Privacy in Natural Language Processing. 11–20.
[117] Wanli Xue, Dinusha Vatsalan, Wen Hu, and Aruna Seneviratne. 2020. Sequence data matching and beyond: New
privacy-preserving primitives based on Bloom filters. IEEE Transactions on Information Forensics and Security (TIFS)
15 (2020), 2973–2987.
[118] Jimei Yang, Brian Price, Scott Cohen, Honglak Lee, and Ming-Hsuan Yang. 2016. Object contour detection with a
fully convolutional encoder-decoder network. In Proc. of the 29th IEEE Conference on Computer Vision and Pattern
Recognition (CVPR’16). 193–202.
[119] Qingqing Ye, Haibo Hu, Xiaofeng Meng, and Huadi Zheng. 2019. PrivKV: Key-value data collection with local differ-
ential privacy. In 40th IEEE Symposium on Security and Privacy (S&P’19). 317–331.
[120] Xiang Yue, Minxin Du, Tianhao Wang, Yaliang Li, Huan Sun, and Sherman S. M. Chow. 2021. Differential privacy for
text analytics via natural text sanitization. In The Joint Conference of the 59th Annual Meeting of the Association for
Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (ACL-IJCNLP).
[121] Xuyun Zhang, Christopher Leckie, Wanchun Dou, Jinjun Chen, Ramamohanarao Kotagiri, and Zoran Salcic. 2016.
Scalable local-recoding anonymization using locality sensitive hashing for big data privacy preservation. In The 25th
ACM International on Conference on Information and Knowledge Management (CIKM’16). 1793–1802.
[122] Tianqing Zhu, Gang Li, Wanlei Zhou, and S. Yu Philip. 2017. Differentially private data publishing and analysis: A
survey. IEEE Transactions on Knowledge and Data Engineering (TKDE) 29, 8 (2017), 1619–1638.

Received January 2021; revised July 2021; accepted September 2021

ACM Computing Surveys, Vol. 54, No. 10s, Article 207. Publication date: September 2022.

You might also like