1.1 The Process of KDD
1.1 The Process of KDD
INTRODUCTION
Data mining has attracted more and more attention in recent years, probably because of
the popularity of the ‘‘big data’’ concept. Data mining is the process of discovering
interesting patterns and knowledge from large amounts of data [1]. As a highly
application-driven discipline, data mining has been successfully applied to many
domains, such as business intelligence, Web search, scientific discovery, digital libraries,
etc.
The term ‘‘data mining’’ is often treated as a synonym for another term ‘‘knowledge
discovery from data’’ (KDD) which highlights the goal of the mining process. To obtain
useful knowledge from data, the following steps are performed in an iterative way (see
Fig. 1.1):
data selection (to retrieve data relevant to the KDD task from the database), data cleaning
(to remove noise and inconsistent data, to handle the missing data fields, etc.) and data
integration (to combine data from multiple sources).
• Step 2: Data transformation. The goal is to transform data into forms appropriate for the
mining task, that is, to find useful features to represent the data. Feature selec- tion and
feature transformation are basic operations.
• Step 3: Data mining. This is an essential process where intelligent methods are
employed to extract data patterns (e.g. association rules, clusters, classification rules, etc).
1
FIGURE 1.1 An overview of the KDD process.
• Step 4: Pattern evaluation and presentation. Basic oper- ations include identifying the
truly interesting patterns which represent knowledge, and presenting the mined
knowledge in an easy-to-understand fashion.
Despite that the information discovered by data mining can be very valuable to many
applications, people have shown increasing concern about the other side of the coin,
namely the privacy threats posed by data mining [2]. Individual’s privacy may be
violated due to the unauthorized access to personal data, the undesired discovery of one’s
embarrassing information, the use of personal data for purposes other than the one for
which data has been collected, etc. For instance, the U.S. retailer Target once received
complaints from a customer who was angry that Target sent coupons for baby clothes to
his teenager daughter.1 However, it was true that the daughter was pregnant at that time,
and Target correctly inferred the fact by mining its customer data. From this story, we
can see that the conflict between data mining and privacy security does exist. To deal
with the privacy issues in data mining, a sub- field of data mining, referred to as privacy
preserving data mining (PPDM) has gained a great development in recent years. The
objective of PPDM is to safeguard sensitive information from unsolicited or unsanctioned
disclosure, and meanwhile, preserve the utility of the data. The consideration of PPDM is
two-fold. First, sensitive raw data, such as individual’s ID card number and cell phone
2
number, should not be directly used for mining. Second, sensitive mining results whose
disclosure will result in privacy violation should be excluded. After the pioneering work
of Aggrawal et al. [3], [4], numerous studies on PPDM have been conducted .
Current models and algorithms proposed for PPDM mainly focus on how to hide those
sensitive information from certain mining operations. However, as depicted in Fig. 1, the
whole KDD process involve multi-phase operations. Besides the mining phase, privacy
issues may also arise in the phase of data collecting or data preprocessing, even in the
delivery process of the mining results. In this paper, we investigate the privacy aspects of
data mining by considering the whole knowledge-discovery process. We present an
overview of the many approaches which can help to make proper use of sensitive data
and protect the security of sensitive information discovered by data mining. We use the
term ‘‘sensitive information’’ to refer to privileged or proprietary information that only
certain people are allowed to see and that is therefore not accessible to everyone. If
sensitive information is lost or used in any way other than intended, the result can be
severe damage to the person or organization to which that information belongs. The term
‘‘sensitive data’’ refers to data from which sensitive information can be extracted.
Throughout the paper, we consider the two terms ‘‘privacy’’ and ‘‘sensitive information’’
are interchangeable.
• Data Provider: the user who owns some data that are desired by the data mining task.
• Data Collector: the user who collects data from data providers and then publish the data
to the data miner.
• Data Miner: the user who performs data mining tasks on the data.
• Decision Maker: the user who makes decisions based on the data mining results in order
to achieve certain goals.
3
In the data mining scenario depicted in Fig. 1.2, a user represents either a person or an
organization. Also, one user can play multiple roles at once. For example, in the Target
story we mentioned above, the customer plays the role of data
provider, and the retailer plays the roles of data collector, data miner and decision maker.
By differentiating the four different user roles, we can explore the privacy issues in data
mining in a principled way. All users care about the security of sensitive information, but
each user role views the security issue from its own perspective. What we need to do is to
identify the privacy problems that each user role is concerned about, and to find
appropriate solutions the problems. Here we briefly describe the privacy concerns of each
user role. Detailed discussions will be presented in following sections.
The major concern of a data provider is whether he can control the sensitivity of the data
he provides to others. On one hand, the provider should be able to make his very private
data, namely the data containing information that he does not want anyone else to know,
inaccessible to the data collector. On the other hand, if the provider has to provide some
data to the data collector, he wants to hide his sensitive information as much as possible
and get enough compensations for the possible loss in privacy.
4
1.3.2 DATA COLLECTOR
The data collected from data providers may contain individuals’ sensitive information.
Directly releasing the data to the data miner will violate data providers’ privacy, hence
data modification is required. On the other hand, the data should still be useful after
modification, otherwise collecting the data will be meaningless. Therefore, the major
concern of data collector is to guarantee that the modified data contain no sensitive
information but still preserve high utility.
The data miner applies mining algorithms to the data provided by data collector, and he
wishes to extract useful information from data in a privacy-preserving manner. As
introduced in Section I-B, PPDM covers two types of protections, namely the protection
of the sensitive data themselves and the protection of sensitive mining results. With the
user role-based methodology proposed in this paper, we consider the data collector
should take the major responsibility of protecting sensitive data, while data miner can
focus on how to hide the sensitive mining results from un trusted parties.
As shown in Fig. 1.2, a decision maker can get the data mining results directly from the
data miner, or from some Information Transmitter. It is likely that the information
transmitter changes the mining results intentionally or unintentionally, which may cause
serious loss to the decision maker. There- fore, what the decision maker concerns is
whether the mining results are credible.[4]
5
using methodologies from game theory [8], we can get useful implications on how each
user role should behavior in an attempt to solve his privacy problems.
The remainder of this paper is organized as follows: Discuss the privacy problems and
approaches to these problems for data provider, data collector, data miner and decision
maker, respectively. Studies of game theoretical approaches in the context of privacy-
preserving data mining are reviewed . Some non-technical issues related to sensitive
information protection .
6
CHAPTER 2
DATA PROVIDER
2.1 CONCERNS OF DATA PROVIDER
A data provider owns some data from which valuable information can be extracted. In the
data mining scenario depicted in Fig. 1.2, there are actually two types of data providers:
one refers to the data provider who provides data to data collector, and the other refers to
the data collector who provides data to data miner. To differentiate the privacy protecting
methods adopted by different user roles, here in this section, we restrict ourselves to the
ordinary data provider, the one who owns a relatively small amount of data which contain
only information about himself. Data reporting information about an individual are often
referred to as ‘‘micro data’’ [5]. If a data provider reveals his micro data to the data
collector, his privacy might be comprised due to the unexpected data breach or exposure
of sensitive information. Hence, the privacy concern of a data provider is whether he can
take control over what kind of and how much information other people can obtain from
his data. To investigate the measures that the data provider can adopt to protect privacy,
we consider the following three situations:
1) If the data provider considers his data to be very sensitive, that is, the data may
reveal some information that he does not want anyone else to know, the provider can just
refuse to provide such data. Effective access- control measures are desired by the data
provider, so that he can prevent his sensitive data from being stolen by the data collector.
2) Realizing that his data are valuable to the data collector (as well as the data
miner), the data provider may be willing to hand over some of his private data in
exchange for certain benefit, such as better services or monetary rewards. The data
provider needs to know how to negotiate with the data collector, so that he will get
enough compensation for any possible loss in privacy.
3) If the data provider can neither prevent the access to his sensitive data nor make a
lucrative deal with the data collector, the data provider can distort his data that will be
fetched by the data collector, so that his true information cannot be easily disclosed.
7
2.2 APPROACHES TO PRIVACY PROTECTION
A data provider provides his data to the collector in an active way or a passive way. By
‘‘active’’ we mean that the data provider voluntarily opts in a survey initiated by the data
collector, or fill in some registration forms to create an account in a website. By
‘‘passive’’ we mean that the data, which are generated by the provider’s routine
activities, are recorded by the data collector, while the data provider may even have no
awareness of the disclosure of his data. When the data provider provides his data actively,
he can simply ignore the collector’s demand for the information that he deems very
sensitive. If his data are passively provided to the data collector, the data provider can
take some measures to limit the collector’s access to his sensitive data.
Suppose that the data provider is an Internet user who is afraid that his online activities
may expose his privacy. To protect privacy, the user can try to erase the traces of his
online activities by emptying browser’s cache, deleting cookies, clearing usage records of
applications, etc. Also, the provider can utilize various security tools that are developed
for Internet environment to protect his data. Many of the security tools are designed as
browser extensions for ease of use. Based on their basic functions, current security tools
can be categorized into the following three types:
8
many web browsers have added support for DNT. DNT is not only a technology but also
a policy framework for how companies that receive the signal should respond. The W3C
Tracking Protection Working Group [10] is now trying to standardize how websites
should response to user’s DNT request.
2) Advertisement and script blockers. This type of browser extensions can block
advertisements on the sites, and kill scripts and widgets that send the user’s data to some
unknown third party. Example tools include Ad Block Plus,6 NoScript,7 FlashBlock,8
etc.
In addition to the tools mentioned above, an Internet user should always use anti-virus
and anti-malware tools to protect his data that are stored in digital equipment such as
personal computer, cell phone and tablet. With the help of all these security tools, the
data provider can limit other’s access to his personal data. Though there is no guarantee
that one’s sensitive data can be completely kept out of the reach of untrustworthy data
collectors, making it a habit of clearing online traces and using security tools does can
help to reduce the risk of privacy disclosure.
In some cases, the data provider needs to make a tradeoff between the loss of privacy and
the benefits brought by participating in data mining. For example, by analyzing a user’s
demographic information and browsing history, a shop- ping website can offer
personalized product recommendations to the user. The user’s sensitive preference may
be disclosed but he can enjoy a better shopping experience. Driven by some benefits, e.g.
a personalized service or monetary incentives, the data provider may be willing to
provide his sensitive data to a trustworthy data collector, who promises the provider’s
sensitive information will not be revealed to an unauthorized third-party. If the provider
9
is able to predict how much benefit he can get, he can rationally decide what kind of and
how many sensitive data to provide. For example, suppose a data collector asks the data
provider to provide information about his age, gender, occupation and annual salary. And
the data collector tells the data provider how much he would pay for each data item. If the
data provider considers salary to be his sensitive information, then based on the prices
offered by the collector, he chooses one of the following actions:
ii) to report a fuzzy value of his salary, e.g. ‘‘less than 10,000 dollars’’, if he thinks
the price is just acceptable; iii) to report an accurate value of his salary, if he thinks the
price is high enough. For this example we can see that, both the privacy preference of
data provider and the incentives offered by data collector will affect the data provider’s
decision on his sensitive data. On the other hand, the data collector can make profit from
the data collected from data providers, and the profit heavily depends on the quantity and
quality of the data. Hence, data providers’ privacy preferences have great influence on
data collector’s profit. The profit plays an important role when data collector decides the
incentives. That is to say, data collector’s decision on incentives is related to data
providers’ privacy preferences. Therefore, if the data provider wants to obtain satisfying
benefits by ‘‘selling’’ his data to the data collector, he needs to consider the effect of his
decision on data collector’s benefits (even the data miner’s benefits), which will in turn
affects the benefits he can get from the collector. In the data-selling scenario, both the
seller (i.e. the data provider) and the buyer (i.e. the data collector) want to get more
benefits, thus the interaction between data provider and data collector can be formally
analyzed by using game theory [12]. Also, the sale of data can be treated as an auction,
where mechanism design [13] theory can be applied. Considering that different user roles
are involved in the sale, and the privacy-preserving methods adopted by data collector
and data miner may have influence on data provider’s decisions, we will review the
applications of game theory and mechanism design after the discussions of other user
roles.
10
2.2.3 PROVIDE FALSE DATA
As discussed above, a data provider can take some measures to prevent data collector
from accessing his sensitive data. However, a disappointed fact that we have to
admit is that no matter how hard they try, Internet users cannot completely stop the
unwanted access to their personal information. So instead of trying to limit the access, the
data provider can provide false information to those untrustworthy data collectors. The
following three methods can help an Internet user to falsify his data:
2) Using a fake identity to create phony information. In 2012, Apple Inc. was
assigned a patient called ‘‘Techniques to pollute electronic profiling’’ [14] which can
help to protect user’s privacy. This patent discloses a method for polluting the
information gathered by ‘‘network eavesdroppers’’ by making a false online identity of a
principal agent, e.g. a service subscriber. The clone identity automatically carries out
numerous online actions which are quite different from a user’s true activities. When a
network eavesdropper collects the data of a user who is utilizing this method, the
eavesdropper will be interfered by the massive data created by the clone identity. Real
information about of the user is buried under the manufactured phony information.
3) Using security tools to mask one’s identity. When a user signs up for a web
service or buys something online, he is often asked to provide information such as email
address, credit card number, phone number, etc. A browser extension called MaskMe,13
which was release by the online privacy company Abine, Inc. in 2013, can help the user
11
to create and manage aliases (or Masks) of these personal information. Users can use
these aliases just like they normally do when such information is required, while the
websites cannot get the real information. In this way, user’s privacy is protected.
12
CHAPTER 3
DATA COLLECTOR
3.1 CONCERNS OF DATA COLLECTOR
As shown in Fig. 1.2, a data collector collects data from data providers in order to support
the subsequent data mining operations. The original data collected from data providers
usually contain sensitive information about individuals. If the data collector doesn’t take
sufficient precautions before releasing the data to public or data miners, those sensitive
information may be disclosed, even though this is not the collector’s original intention.
For example, on October 2, 2006, the U.S. online movie rental service Netflix14
released a data set containing movie ratings of 500,000 subscribers to the public for a
challenging competition called ’’the Netflix Prize". The goal of the competition was to
improve the accuracy of personalized movie recommendations. The released data set
was supposed to be privacy-safe, since each data record only contained a subscriber ID
(irrelevant with the subscriber’s real identity), the movie info, the rating, and the date on
which the subscriber rated the movie. However, soon after the release, two researchers
[16] from University of Texas found that with a little bit of auxiliary information about
an individual subscriber, e.g. 8 movie ratings (of which 2 may be completely wrong) and
dates that may have a 14-day error, an adversary can easily identify the individual’s
record (if the record is present in the data set).
From above example we can see that, it is necessary for the data collector to modify the
original data before releasing them to others, so that sensitive information about data
providers can neither be found in the modified data nor be inferred by anyone with
malicious intent. Generally, the modification will cause a loss in data utility. The data
collector should also make sure that sufficient utility of the data can be retained after the
modification, otherwise collecting the data will be a wasted effort. The data modification
process adopted by data collector, with the goal of preserving privacy and utility
simultaneously, is usually called privacy preserving data publishing (PPDP).
13
Extensive approaches to PPDP have been proposed in last decade. Fung et al. have
systematically summarized and evaluated different approaches in their frequently cited
survey [12]. Also, Wong and Fu have made a detailed review of studies on PPDP in their
monograph [14]. To differentiate with their work, in this paper we mainly focus on how
PPDP is realized in two emerging applications, namely social networks and location-
based services. To make our review more self-contained, in next subsection we will first
briefly introduce some basics of PPDP, e.g. the privacy model, typical anonymization
operations, information metrics, etc, and then we will review studies on social networks
and location-based services respectively.
PPDP mainly studies anonymization approaches for publish- ing useful data while
preserving privacy. The original data is assumed to be a private table consisting of
multiple records. Each record consists of the following 4 types of attributes:
• Identifier (ID): Attributes that can directly and uniquely identify an individual,
such as name, ID number and
mobile number.
• Quasi-identifier (QID): Attributes that can be linked with external data to re-
identify individual records, such as gender, age and zip code.
• Non-sensitive Attribute (NSA): Attributes other than ID, QID and SA.
Social networks have gained great development in recent years. Aiming at discovering
interesting social patterns, social network analysis becomes more and more important. To
support the analysis, the company who runs a social net- work application sometimes
14
needs to publish its data to a third party. However, even if the truthful identifiers of
individuals are removed from the published data, which is referred to as naïve
anonymized, publication of the network data may lead to exposures of sensitive
information about individuals, such as one’s intimate relationships with others.
Therefore, the network data need to be properly anonymized before they are published.
A social network is usually modeled as a graph, where the vertex represents an entity and
the edge represents the relationship between two entities. Thus, PPDP in the context of
social networks mainly deals with anonymizing graph data, which is much more
challenging than anonymizing relational table data. Zhou et al. [14] have identified the
following three challenges in social network data anonymization:
First, modeling adversary’s background knowledge about the network is much harder.
For relational data tables, a small set of quasi-identifiers are used to define the attack
models. While given the network data, various information, such as attributes of an
entity and relationships between different entities, may be utilized by the adversary.
Second, measuring the information loss in anonymizing social network data is harder
than that in anonymizing relational data. It is difficult to determine whether the original
network and the anonymized network are different in certain properties of the network.
Third, devising anonymization methods for social network data is much harder than that
for relational data. Anonymizing a group of tuples in a relational table does not affect
other tuples. However, when modifying a network, changing one vertex or edge may
affect the rest of the network. Therefore, ‘‘divide-and-conquer’’ methods, which are
widely applied to relational data, cannot be applied to network data.
To deal with above challenges, many approaches have been proposed. According to [15],
anonymization methods on simple graphs, where vertices are not associated with
attributes and edges have no labels, can be classified into three categories, namely edge
modification, edge randomization, and clustering-based generalization. Comprehensive
surveys of approaches to on social network data anonymization can be found . In this
paper, we briefly review some of the very recent studies, with focus on the following
three aspects: attack model, privacy model, and data utility.
15
3.3.3 ATTACK MODEL
Given the anonymized network data, adversaries usually rely on background knowledge
to de-anonymize individuals and learn relationships between de-anonymized individuals.
Zhou et al. [14] identify six types of the back- ground knowledge, i.e. attributes of
vertices, vertex degrees, link relationship, neighborhoods, embedded sub graphs and
graph metrics. Pengetal. [17] propose an algorithm called Seed-and-Grow to identify
users from an anonymized social graph, based solely on graph structure. The algorithm
first identifies a seed sub-graph which is either planted by an attacker or divulged
by collusion of a small group of users, and then grows the seed larger based on the
adversary’s existing knowledge of users’ social relations. Zhu et al design a structural
attack to de-anonymize social graph data. The attack uses the cumulative degree of
(a) (b)
16
(a) (b)
FIGURE 3.2. Example of friend attack: (a) original network; (b) naïve anonymized
network.
n-hop neighbors of a vertex as the regional feature, and com- bines it with the simulated
annealing-based graph matching method to re-identify vertices in anonymous social
graphs. Sun et al. introduce a relationship attack model called mutual friend attack,
which is based on the number of mutual friends of two connected individuals. Fig. 3.2
shows an example of the mutual friend attack. The original social network G with vertex
identities is shown in Fig. 3.2(a), and Fig. 3.2(b) shows the corresponding anonymized
network where all individuals’ names are removed. In this network, only Aliceand Bob
have 4 mutual friends. If an adversary knows this information, then he can uniquely re-
identify the edge (D, E) in Fig. 3.2(b) is (Alice, Bob). In Tai et al. investigate the
friendship attack where an adversary utilizes the degrees of two vertices connected by an
edge to re-identify related victims in a published social network data set. Fig. 3.3 shows
an example of friendship attack. Suppose that each user’s friend count (i.e. the degree of
the vertex) is publicly available. If the adversary knows that Bob has 2 friends and Carl
has 4 friends, and he also knows that Bob and Carl are friends, then he can uniquely
identify that the edge (2, 3) in Fig. 3.3(b) corresponds to (Bob, Carl). In another type of
attack, namely degree attack, is explored. The motivation is that each individual in a
social network is inclined to associate with not only a vertex identity but also a
community identity, and the community identity reflects some sensitive information
about the individual. It has been shown that, based on some background knowledge about
vertex degree, even if the adversary cannot precisely identify the vertex corresponding to
an individual, community information and neighborhood information can still be inferred.
17
For example, the network shown in Fig.3.5 consists of two communities, and the
community identity reveals sensitive information (i.e. disease status) about its members.
Suppose that an adversary knows Jhon has 5 friends, then he can infer that Jhon has
AIDS, even though he is not sure which of the two vertices
(a) (b)
Figure 3.3 : Example of degree attack: (a) original network; (b) naïve anonymized
network
(c) 6-NMF.
18
(a) (b) (c)
(vertex 2 and vertex 3) in the anonymized network (Fig. 3.5(b)) corresponds to Jhon.
From above discussion we can see that, the graph data contain rich information that can
be explored by the adversary to initiate an attack. Modeling the background knowledge of
the adversary is difficult yet very important for deriving the privacy models.
Based on the classic k-anonymity model, a number of privacy models have been
proposed for graph data. Some of the models have been summarized in the survey , such
as k-degree, k-neighborhood, k-automorphism, k-isomorphism, and k-symmetry. In order
to protect the privacy of relationship from the mutual friend attack, Sun et al. introduce a
variant of k-anonymity, called k-NMF anonymity. NMF is a property defined for the
edge in an undirected simple graph, representing the number of mutual friends between
the two individuals linked by the edge. If a network satisfies k-NMF anonymity (see Fig.
3.6), then for each edge e, there will be at least k 1 other edges with the same number
of mutual friends as e. It can be guaranteed that the probability of an edge being
identified is not greater than 1/k. Tai et al. [13] introduce the concept of
is k2-degree anonymous if, for every vertex with an incident edge of degree pair (d1, d2)
in G¯ , there exist at least k 1
other vertices, such that each of the k 1 vertices also has an incident edge of the same
degree pair (see Fig. 8). Intuitively, if a graph is k2-degree anonymous, then the
19
probability of a vertex being re-identified is not greater than 1/k, even if an adversary
knows a certain degree pair (dA, dB), where
(a) (b)
A and B are friends. To prevent degree attacks, Tai et al. [13] introduce the concept of
structural diversity. A graph satisfies k-structural diversity anonymization (k-SDA), if for
every vertex v in the graph, there are at least k communities, such that each of the
communities contains at least one vertex with the same degree as v (see Fig. 3.6). In other
words, for each
vertex v, there are at least k − 1 other vertices located in at least k − 1 other communities.
20
3.3.5 DATA UTILITY
In the context of network data anonymization, the implication of data utility is: whether
and to what extent properties of the graph are preserved. Wu et al. [15] summarize three
types of properties considered in current studies. The first type is graph topological
properties, which are defined for applications aiming at analyzing graph properties.
Various measures have been proposed to indicate the structure characteristics of the
network. The second type is graph spectral properties. The spectrum of a graph is usually
defined as the set of eigen values of the graph’s adjacency matrix or other derived
matrices, which has close relations with many graph characteristics. The third type is
aggregate network queries. An aggregate network query calculates the aggregate on some
paths or sub graphs satisfying some query conditions. The accuracy of answering
aggregate network queries can be considered as the measure of utility preservation. Most
existing k-anonymization algorithms for network data publishing perform edge insertion
and/or deletion operations, and they try to reduce the utility loss by minimizing the
changes on the graph degree sequence. Wang et al. [13] consider that the degree sequence
only captures limited structural properties of the graph and the derived anonymization
methods may cause large utility loss. They propose utility loss measurements built on the
community-based graph models, including both the flat com- munity model and the
hierarchical community model, to better capture the impact of anonymization on network
topology. One important characteristic of social networks is that they keep evolving over
time. Sometimes the data collector needs to publish the network data periodically. The
privacy issue in sequential publishing of dynamic social network data has recently
attracted researchers’ attention. Med forth and Wang [14] identify a new class of privacy
attack, named degree-trail attack, arising from publishing a sequence of graph data. They
demonstrate that even if each published graph is anonymized by strong privacy
preserving techniques, an adversary with little background knowledge can re-identify the
vertex belonging to a known target individual by comparing the degrees of vertices in the
published graphs with the degree evolution of a target. In [15], Tai et al. adopt the same
attack model used in [34], and pro- pose a privacy model called dynamic kw structural
diversity anonymity (kw-SDA), for protecting the vertex and multi-community identities
21
in sequential releases of a dynamic network. The parameter k has a similar implication as
in the original k-anonymity model, and w denotes a time period that an adversary can
monitor a target to collect the attack knowledge. They develop a heuristic algorithm for
generating releases satisfying this privacy requirement.
22
CHAPTER 4
DATA MINER
In order to discover useful knowledge which is desired by the decision maker, the data
miner applies data mining algorithms to the data obtained from data collector. The
privacy issues coming with the data mining operations are twofold. On one hand, if
personal information can be directly observed in the data and data breach happens,
privacy of the original data owner (i.e. the data provider) will be compromised. On the
other hand, equipping with the many powerful data mining techniques, the data miner is
able to find out various kinds of information underlying the data. Sometimes the data
mining results may reveal sensitive information about the data owners. For example, in
the Target story we mentioned in Section I-B, the information about the daughter’s
pregnancy, which is inferred by the retailer via mining customer data, is something that
the daughter does not want others to know. To encourage data providers to participate
in the data mining activity and provide more sensitive data, the data miner needs to make
sure that the above two privacy threats are eliminated, or in other words, data providers’
privacy must be well preserved. Different from existing surveys on privacy-preserving
data mining (PPDM), in this paper, we consider it is the data collector’s responsibility to
ensure that sensitive raw data are modified or trimmed out from the published data. The
primary concern of data miner is how to prevent sensitive information from appearing in
the mining results. To perform a privacy-preserving data mining, the data miner usually
needs to modify the data he got from the data collector. As a result, the decline of data
utility is inevitable. Similar to data collector, the data miner also faces the privacy-utility
trade-off problem. But in the context of PPDM, quantifications of privacy and utility are
closely related to the mining algorithm employed by the data miner.
23
4.2 APPROACHES TO PRIVACY PROTECTION
Extensive PPDM approaches have been proposed (see [5]–[7] for detailed surveys).
These approaches can be classified by different criteria [13], such as data distribution,
data modification method, data mining algorithm, etc. Based on the distribution of data,
PPDM approaches can be classified into two categories, namely approaches for
centralized data mining and approaches for distributed data mining. Distributed data
mining can be further categorized into data mining over horizontally partitioned data and
data mining over vertically partitioned data. Based on the technique adopted for data
modification, PPDM can be classified into perturbation-based, blocking-based,
swapping- based, etc. Since we define the privacy-preserving goal of data miner as
preventing sensitive information from being revealed by the data mining results, in this
section, we classify PPDM approaches according to the type of data mining tasks.
Specifically, we review recent studies on privacy-preserving association rule mining,
privacy-preserving classification, and privacy-preserving clustering, respectively.
Since many of the studies deal with distributed data mining where secure multi-party
computation is widely applied, here we make a brief introduction of secure multi-party
computation (SMC). SMC is a subfield of cryptography. In general, SMC assumes a
number of participants P1, P2, . . . , Pm, each has a private data, X1, X2, . . . , Xm.
The participants want to compute the value of a public function f on m variables at the
point X1, X2, . . . , Xm. A SMC protocol is called secure, if at the end of the
computation, no participant knows anything except his own data and the results of global
calculation. We can view this by imagining that there is a trusted-third-party (TTP).
Every participant give his input to the TTP, and the TTP performs the computation and
sends the results to the participants. By employing a SMC protocol, the same result can
be achieved without the TTP. In the context of distributed data mining, the goal of SMC
is to make sure that each participant can get the correct data mining result without
revealing his data to others.
24
CHAPTER 5
DECISION MAKER
The ultimate goal of data mining is to provide useful information to the decision maker,
so that the decision maker can choose a better way to achieve his objective, such as
increasing sales of products or making correct diagnoses of diseases. At a first glance, it
seems that the decision maker has no responsibility for protecting privacy, since we
usually interpret privacy as sensitive information about the original data owners (i.e. data
providers). Generally, the data miner, the data collector and the data provider himself are
considered to be responsible for the safety of privacy. However, if we look at the privacy
issue from a wider perspective, we can see that the decision maker also has his own
privacy concerns. The data mining results provided by the data miner are of high
importance to the decision maker. If the results are disclosed to someone else, e.g. a
competing company, the decision maker may suffer a loss. That is to say, from the
perspective of decision maker, the data mining results are sensitive information. On the
other hand, if the decision maker does not get the data mining results directly from the
data miner, but from someone else which we called information transmitter, the decision
maker should be skeptical about the credibility of the results, in case that the results have
been distorted. Therefore, the privacy concerns of the decision maker are twofold: how to
prevent unwanted disclosure of sensitive mining results, and how to evaluate the
credibility of the received mining results.
To deal with the first privacy issue proposed above, i.e. to prevent unwanted disclosure
of sensitive mining results, usually the decision maker has to resort to legal measures. For
25
example, making a contract with the data miner to forbid the miner from disclosing the
mining results to a third party. To handle the second issue, i.e. to determine whether the
received information can be trusted, the decision maker can utilize methodologies from
data provenance, credibility analysis of web information, or other related research fields.
In the rest part of this section, we will first briefly review the studies on data provenance
and web information credibility, and then present a preliminary discussion about how
these studies can help to analyze the credibility of data mining results.
If the decision maker does not get the data mining results directly from the data miner, he
would want to know how the results are delivered to him and what kind of modification
may have been applied to the results, so that he can determine whether the results can be
trusted. This is why ‘‘provenance’’ is needed. The term provenance originally refers to
the chronology of the ownership, custody or location of a historical object. In information
science, a piece of data is treated as the historical object, and data provenance refers to
the information that helps determine the derivation history of the data, starting from the
original source [18]. Two kinds of information can be found in the provenance of the
data: the ancestral data from which current data evolved, and the transformations applied
to ancestral data that helped to produce current data. With such information, people can
better understand the data and judge the credibility of the data.
Since 1990s, data provenance has been extensively studied in the fields of databases and
workflows. Several surveys are now available. In [18], Simmhan et al. present a
taxonomy of data provenance techniques. The following five aspects are used to capture
the characteristics of a provenance system:
26
• Representation of provenance. There are mainly two types of methods to
represent provenance information, one is annotation and the other is inversion. The
annotation method uses metadata, which comprise of the derivation history of the data, as
annotations and descriptions about sources data and processes. The inversion method
uses the property by which some derivations can be inverted to find the input data
supplied to derive the output data.
• Provenance storage. Provenance can be tightly coupled to the data it describes and
located in the same data storage system or even be embedded within the data file.
Alternatively, provenance can be stored separately with other metadata or simply by
itself.
In [12], Glavic et al. present another categorization scheme for provenance system. The
proposed scheme consists of three main categorizes: provenance model, query and
manipulation functionality, storage model and recording strategy. Davidson and Freire
review studies on provenance for scientific workflows. They summarize the key
components of a provenance management solution, discuss applications for workflow
provenance, and outline a few open problems for database-related research.
27
knowledge. Barbier and Liu study the information provenance problem in social media.
They model the social network as a directed graph G(V, E, p), where V is the node set
and E is the edge set. Each node in the graph represents an entity and each directed edge
represents the direction of information propagation. An information propagation
probability p is attached to each edge. Based on the model, they define the information
provenance problem as follows: given a directed graph G(V, E, p), with known
terminals T V , and a positive integer constant k Z +, identify the sources S V,
such that S k, and U (S, T) is maximized. The function U (S, T) estimates the utility
of information propagation which starts from the sources S and stops at the terminals. To
solve this provenance problem, one can leverage the unique features of social networks,
e.g. user profiles, user interactions, spatial or temporal information, etc. Two approaches
are developed to seek the provenance of information. One approach utilizes the network
information to directly seek the provenance of information, and the other approach aims
at finding the reverse flows of information propagation.
The special characteristics of Internet, such as openness, freedom and anonymity, pose
great challenges for seeking provenance of information. Compared to the approaches
developed in the context of databases and workflows, current solutions proposed for
supporting provenance in Internet environment are less mature. There are still many
problems to be explored in future study.
Because of the lack of publishing barriers, the low cost of dissemination, and the lax
control of quality, credibility of web information has become a serious issue. Tudjman et
al. [17] identify the following five criteria that can be employed by Internet users to
differentiate false information from the truth:
• Accuracy: false information does not contain accurate data or approved facts.
28
• Currency: for false information, the data about its source, time and place of its
origin is incomplete, out of date, or missing.
In [98], Metzger summarizes the skills that can help users to assess the credibility of
online information.
With the rapid growth of online social media, false information breeds more easily and
spreads more widely than before, which further increases the difficulty of judging
information credibility. Identifying rumors and their sources in micro blogging networks
has recently become a hot research topic .Current research usually treats rumor
identification as a classification problem, thus the following two issues are involved:
• Preparation of training data set. Current studies usually take rumors that have
been confirmed by authorities as positive training samples. Considering the huge amount
of messages in micro blogging networks, such training samples are far from enough to
train a good classifier. Building a large benchmark data set of rumors is in urgent need.
• Feature selection. Various kinds of features can be used to characterize the micro
blogging messages. In cur- rent literature, the following three types of features are often
used: content-based features, such as word unigram/bigram, part-of-speech
unigram/bigram, text length, number of sentiment word (positive/negative), number of
URL, and number of hash tag; user-related features, such as registration time, registration
location, number of friends, number of followers, and number of messages posted by the
user; network features, such as number of comments and number of re tweets.
29
CONCLUSION
We review the privacy issues related to data mining by using a user-role based
methodology. We differentiate four different user roles that are commonly involved in
data mining applications, i.e. data provider, data collector, data miner and decision
maker. Each user role has its own privacy concerns hence the privacy-preserving
approaches adopted by one user role are generally different from those adopted by others:
• For data collector, his privacy-preserving objective is to release useful data to data
miners without disclosing data providers’ identities and sensitive information about them.
To achieve this goal, he needs to develop proper privacy models to quantify the possible
loss of privacy under different attacks, and apply anonymization techniques to the data.
• For data miner, his privacy-preserving objective is to get correct data mining
results while keep sensitive information undisclosed either in the process of data mining
or in the mining results. To achieve this goal, he can choose a proper method to modify
the data before certain mining algorithms are applied to, or utilize secure computation
protocols to ensure the safety of private data and sensitive information contained in the
learned model.
30
To achieve the privacy-preserving goals of different users roles, various methods from
different research fields are required. We have reviewed recent progress in related
studies, and discussed problems awaiting to be further investigated. We hope that the
review presented in this paper can offer researchers different insights into the issue of
privacy-preserving data mining, and promote the exploration of new solutions to the
security of sensitive information.
31
REFERENCES
[1] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. San
Mateo, CA, USA: Morgan Kaufmann, 2006.
ery and data mining,’’ in Proc. Austral. Inst. Comput. Ethics Conf., 1999, pp. 89–99.
Preserving Data Mining Models and Algorithms. New York, NY, USA: Springer-Verlag,
2008.
techniques: Current scenario and future prospects,’’ in Proc. 3rd Int. Conf. Comput.
Commun. Technol. (ICCCT), Nov. 2012, pp. 26–32.
32
[9] V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati, ‘‘Microdata
protection,’’ in Secure Data Management in Decentralized Systems. New York, NY,
USA: Springer-Verlag, 2007, pp. 291–321.
[10] O. Tene and J. Polenetsky, ‘‘To track or ‘do not track’: Advancing
Wheatsheaf, 1992.
sparse datasets,’’ in Proc. IEEE Symp. Secur. Privacy (SP), May 2008, pp. 111–125.
33
data publishing: A survey of recent developments,’’ ACM Comput. Surv., vol. 42, no. 4,
Jun. 2010, Art. ID 14.
ing: An overview,’’ Synthesis Lectures Data Manage., vol. 2, no. 1, pp. 1–138, 2010.
34