0% found this document useful (0 votes)
71 views

1.1 The Process of KDD

The document discusses privacy concerns in data mining and knowledge discovery. It begins by introducing data mining and the knowledge discovery process. This involves data preprocessing, transformation, mining, and result evaluation. The document then discusses the conflict between privacy and data mining, and the field of privacy-preserving data mining. Next, it proposes a framework that identifies four key user roles in data mining: data provider, data collector, data miner, and decision maker. It discusses the different privacy concerns each user role faces at different stages of the knowledge discovery process. Finally, it discusses how game theory can be used to model interactions between users and help address privacy issues.

Uploaded by

jwngma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views

1.1 The Process of KDD

The document discusses privacy concerns in data mining and knowledge discovery. It begins by introducing data mining and the knowledge discovery process. This involves data preprocessing, transformation, mining, and result evaluation. The document then discusses the conflict between privacy and data mining, and the field of privacy-preserving data mining. Next, it proposes a framework that identifies four key user roles in data mining: data provider, data collector, data miner, and decision maker. It discusses the different privacy concerns each user role faces at different stages of the knowledge discovery process. Finally, it discusses how game theory can be used to model interactions between users and help address privacy issues.

Uploaded by

jwngma
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

CHAPTER 1

INTRODUCTION
Data mining has attracted more and more attention in recent years, probably because of
the popularity of the ‘‘big data’’ concept. Data mining is the process of discovering
interesting patterns and knowledge from large amounts of data [1]. As a highly
application-driven discipline, data mining has been successfully applied to many
domains, such as business intelligence, Web search, scientific discovery, digital libraries,
etc.

1.1 THE PROCESS OF KDD

The term ‘‘data mining’’ is often treated as a synonym for another term ‘‘knowledge
discovery from data’’ (KDD) which highlights the goal of the mining process. To obtain
useful knowledge from data, the following steps are performed in an iterative way (see
Fig. 1.1):

• Step 1: Data preprocessing. Basic operations include

data selection (to retrieve data relevant to the KDD task from the database), data cleaning
(to remove noise and inconsistent data, to handle the missing data fields, etc.) and data
integration (to combine data from multiple sources).

• Step 2: Data transformation. The goal is to transform data into forms appropriate for the
mining task, that is, to find useful features to represent the data. Feature selec- tion and
feature transformation are basic operations.

• Step 3: Data mining. This is an essential process where intelligent methods are
employed to extract data patterns (e.g. association rules, clusters, classification rules, etc).

1
FIGURE 1.1 An overview of the KDD process.
• Step 4: Pattern evaluation and presentation. Basic oper- ations include identifying the
truly interesting patterns which represent knowledge, and presenting the mined
knowledge in an easy-to-understand fashion.

1.2 THE PRIVACY CONCERN AND PPDM

Despite that the information discovered by data mining can be very valuable to many
applications, people have shown increasing concern about the other side of the coin,
namely the privacy threats posed by data mining [2]. Individual’s privacy may be
violated due to the unauthorized access to personal data, the undesired discovery of one’s
embarrassing information, the use of personal data for purposes other than the one for
which data has been collected, etc. For instance, the U.S. retailer Target once received
complaints from a customer who was angry that Target sent coupons for baby clothes to
his teenager daughter.1 However, it was true that the daughter was pregnant at that time,
and Target correctly inferred the fact by mining its customer data. From this story, we
can see that the conflict between data mining and privacy security does exist. To deal
with the privacy issues in data mining, a sub- field of data mining, referred to as privacy
preserving data mining (PPDM) has gained a great development in recent years. The
objective of PPDM is to safeguard sensitive information from unsolicited or unsanctioned
disclosure, and meanwhile, preserve the utility of the data. The consideration of PPDM is
two-fold. First, sensitive raw data, such as individual’s ID card number and cell phone

2
number, should not be directly used for mining. Second, sensitive mining results whose
disclosure will result in privacy violation should be excluded. After the pioneering work
of Aggrawal et al. [3], [4], numerous studies on PPDM have been conducted .

1.3 USER ROLE-BASED METHODOLOGY

Current models and algorithms proposed for PPDM mainly focus on how to hide those
sensitive information from certain mining operations. However, as depicted in Fig. 1, the
whole KDD process involve multi-phase operations. Besides the mining phase, privacy
issues may also arise in the phase of data collecting or data preprocessing, even in the
delivery process of the mining results. In this paper, we investigate the privacy aspects of
data mining by considering the whole knowledge-discovery process. We present an
overview of the many approaches which can help to make proper use of sensitive data
and protect the security of sensitive information discovered by data mining. We use the
term ‘‘sensitive information’’ to refer to privileged or proprietary information that only
certain people are allowed to see and that is therefore not accessible to everyone. If
sensitive information is lost or used in any way other than intended, the result can be
severe damage to the person or organization to which that information belongs. The term
‘‘sensitive data’’ refers to data from which sensitive information can be extracted.
Throughout the paper, we consider the two terms ‘‘privacy’’ and ‘‘sensitive information’’
are interchangeable.

We develop a user-role based methodology to conduct the review of related studies.


Based on the stage division in KDD process (see Fig. 1.1), we can identify four different
types of users, namely four user roles, in a typical data mining scenario (see Fig. 1.2):

• Data Provider: the user who owns some data that are desired by the data mining task.

• Data Collector: the user who collects data from data providers and then publish the data
to the data miner.

• Data Miner: the user who performs data mining tasks on the data.

• Decision Maker: the user who makes decisions based on the data mining results in order
to achieve certain goals.

3
In the data mining scenario depicted in Fig. 1.2, a user represents either a person or an
organization. Also, one user can play multiple roles at once. For example, in the Target
story we mentioned above, the customer plays the role of data

FIGURE 1.2.Application scenario with data mining at the core.

provider, and the retailer plays the roles of data collector, data miner and decision maker.

By differentiating the four different user roles, we can explore the privacy issues in data
mining in a principled way. All users care about the security of sensitive information, but
each user role views the security issue from its own perspective. What we need to do is to
identify the privacy problems that each user role is concerned about, and to find
appropriate solutions the problems. Here we briefly describe the privacy concerns of each
user role. Detailed discussions will be presented in following sections.

1.3.1 DATA PROVIDER

The major concern of a data provider is whether he can control the sensitivity of the data
he provides to others. On one hand, the provider should be able to make his very private
data, namely the data containing information that he does not want anyone else to know,
inaccessible to the data collector. On the other hand, if the provider has to provide some
data to the data collector, he wants to hide his sensitive information as much as possible
and get enough compensations for the possible loss in privacy.

4
1.3.2 DATA COLLECTOR

The data collected from data providers may contain individuals’ sensitive information.
Directly releasing the data to the data miner will violate data providers’ privacy, hence
data modification is required. On the other hand, the data should still be useful after
modification, otherwise collecting the data will be meaningless. Therefore, the major
concern of data collector is to guarantee that the modified data contain no sensitive
information but still preserve high utility.

1.3.3 DATA MINER

The data miner applies mining algorithms to the data provided by data collector, and he
wishes to extract useful information from data in a privacy-preserving manner. As
introduced in Section I-B, PPDM covers two types of protections, namely the protection
of the sensitive data themselves and the protection of sensitive mining results. With the
user role-based methodology proposed in this paper, we consider the data collector
should take the major responsibility of protecting sensitive data, while data miner can
focus on how to hide the sensitive mining results from un trusted parties.

1.3.4 DECISION MAKER

As shown in Fig. 1.2, a decision maker can get the data mining results directly from the
data miner, or from some Information Transmitter. It is likely that the information
transmitter changes the mining results intentionally or unintentionally, which may cause
serious loss to the decision maker. There- fore, what the decision maker concerns is
whether the mining results are credible.[4]

In addition to investigate the privacy-protection approaches adopted by each user role, in


this paper we emphasize a common type of approach, namely game theoretical approach,
that can be applied to many problems involving privacy protection in data mining. The
rationality is that, in the data mining scenario, each user pursues high self-interests in
terms of privacy preservation or data utility, and the interests of different users are
correlated. Hence the interactions among different users can be modeled as a game. By

5
using methodologies from game theory [8], we can get useful implications on how each
user role should behavior in an attempt to solve his privacy problems.

1.4 PAPER ORGANIZATION

The remainder of this paper is organized as follows: Discuss the privacy problems and
approaches to these problems for data provider, data collector, data miner and decision
maker, respectively. Studies of game theoretical approaches in the context of privacy-
preserving data mining are reviewed . Some non-technical issues related to sensitive
information protection .

6
CHAPTER 2

DATA PROVIDER
2.1 CONCERNS OF DATA PROVIDER

A data provider owns some data from which valuable information can be extracted. In the
data mining scenario depicted in Fig. 1.2, there are actually two types of data providers:
one refers to the data provider who provides data to data collector, and the other refers to
the data collector who provides data to data miner. To differentiate the privacy protecting
methods adopted by different user roles, here in this section, we restrict ourselves to the
ordinary data provider, the one who owns a relatively small amount of data which contain
only information about himself. Data reporting information about an individual are often
referred to as ‘‘micro data’’ [5]. If a data provider reveals his micro data to the data
collector, his privacy might be comprised due to the unexpected data breach or exposure
of sensitive information. Hence, the privacy concern of a data provider is whether he can
take control over what kind of and how much information other people can obtain from
his data. To investigate the measures that the data provider can adopt to protect privacy,
we consider the following three situations:

1) If the data provider considers his data to be very sensitive, that is, the data may
reveal some information that he does not want anyone else to know, the provider can just
refuse to provide such data. Effective access- control measures are desired by the data
provider, so that he can prevent his sensitive data from being stolen by the data collector.

2) Realizing that his data are valuable to the data collector (as well as the data
miner), the data provider may be willing to hand over some of his private data in
exchange for certain benefit, such as better services or monetary rewards. The data
provider needs to know how to negotiate with the data collector, so that he will get
enough compensation for any possible loss in privacy.

3) If the data provider can neither prevent the access to his sensitive data nor make a
lucrative deal with the data collector, the data provider can distort his data that will be
fetched by the data collector, so that his true information cannot be easily disclosed.

7
2.2 APPROACHES TO PRIVACY PROTECTION

2.2.1 LIMIT THE ACCESS

A data provider provides his data to the collector in an active way or a passive way. By
‘‘active’’ we mean that the data provider voluntarily opts in a survey initiated by the data
collector, or fill in some registration forms to create an account in a website. By
‘‘passive’’ we mean that the data, which are generated by the provider’s routine
activities, are recorded by the data collector, while the data provider may even have no
awareness of the disclosure of his data. When the data provider provides his data actively,
he can simply ignore the collector’s demand for the information that he deems very
sensitive. If his data are passively provided to the data collector, the data provider can
take some measures to limit the collector’s access to his sensitive data.

Suppose that the data provider is an Internet user who is afraid that his online activities
may expose his privacy. To protect privacy, the user can try to erase the traces of his
online activities by emptying browser’s cache, deleting cookies, clearing usage records of
applications, etc. Also, the provider can utilize various security tools that are developed
for Internet environment to protect his data. Many of the security tools are designed as
browser extensions for ease of use. Based on their basic functions, current security tools
can be categorized into the following three types:

1) Anti-tracking extensions. Knowing that valuable information can be extracted


from the data produced by users’ online activities, Internet companies have a strong
motivation to track the users’ movements on the Internet. When browsing the Internet, a
user can utilize an anti-tracking extension to block the trackers from collecting the
cookies.2 Popular anti-tracking extensions include Disconnect,3 Do Not Track Me,4
Ghostery,5 etc. A major technology used for anti- tracking is called Do Not Track (DNT)
[10], which enables users to opt out of tracking by websites they do not visit. A user’s
opt-out preference is signaled by an HTTP header field named DNT : if DNT 1, it
means the user does not want to be tracked (opt out). Two U.S. researchers first created a
prototype add-on supporting DNT header for the Firefox web browser in 2009. Later,

8
many web browsers have added support for DNT. DNT is not only a technology but also
a policy framework for how companies that receive the signal should respond. The W3C
Tracking Protection Working Group [10] is now trying to standardize how websites
should response to user’s DNT request.

2) Advertisement and script blockers. This type of browser extensions can block
advertisements on the sites, and kill scripts and widgets that send the user’s data to some
unknown third party. Example tools include Ad Block Plus,6 NoScript,7 FlashBlock,8
etc.

3) Encryption tools. To make sure a private online communication between two


parties cannot be intercepted by third parties, a user can utilize encryption tools, such as
MailCloak9 and TorChat,10 to encrypt his emails, instant messages, or other types of
web traffic. Also, a user can encrypt all of his internet traffic by using a VPN (virtual
private network)11 service.

In addition to the tools mentioned above, an Internet user should always use anti-virus
and anti-malware tools to protect his data that are stored in digital equipment such as
personal computer, cell phone and tablet. With the help of all these security tools, the
data provider can limit other’s access to his personal data. Though there is no guarantee
that one’s sensitive data can be completely kept out of the reach of untrustworthy data
collectors, making it a habit of clearing online traces and using security tools does can
help to reduce the risk of privacy disclosure.

2.2.2 TRADE PRIVACY FOR BENEFIT

In some cases, the data provider needs to make a tradeoff between the loss of privacy and
the benefits brought by participating in data mining. For example, by analyzing a user’s
demographic information and browsing history, a shop- ping website can offer
personalized product recommendations to the user. The user’s sensitive preference may
be disclosed but he can enjoy a better shopping experience. Driven by some benefits, e.g.
a personalized service or monetary incentives, the data provider may be willing to
provide his sensitive data to a trustworthy data collector, who promises the provider’s
sensitive information will not be revealed to an unauthorized third-party. If the provider

9
is able to predict how much benefit he can get, he can rationally decide what kind of and
how many sensitive data to provide. For example, suppose a data collector asks the data
provider to provide information about his age, gender, occupation and annual salary. And
the data collector tells the data provider how much he would pay for each data item. If the
data provider considers salary to be his sensitive information, then based on the prices
offered by the collector, he chooses one of the following actions:

i) not to report his salary, if he thinks the price is too low;

ii) to report a fuzzy value of his salary, e.g. ‘‘less than 10,000 dollars’’, if he thinks
the price is just acceptable; iii) to report an accurate value of his salary, if he thinks the
price is high enough. For this example we can see that, both the privacy preference of
data provider and the incentives offered by data collector will affect the data provider’s
decision on his sensitive data. On the other hand, the data collector can make profit from
the data collected from data providers, and the profit heavily depends on the quantity and
quality of the data. Hence, data providers’ privacy preferences have great influence on
data collector’s profit. The profit plays an important role when data collector decides the
incentives. That is to say, data collector’s decision on incentives is related to data
providers’ privacy preferences. Therefore, if the data provider wants to obtain satisfying
benefits by ‘‘selling’’ his data to the data collector, he needs to consider the effect of his
decision on data collector’s benefits (even the data miner’s benefits), which will in turn
affects the benefits he can get from the collector. In the data-selling scenario, both the
seller (i.e. the data provider) and the buyer (i.e. the data collector) want to get more
benefits, thus the interaction between data provider and data collector can be formally
analyzed by using game theory [12]. Also, the sale of data can be treated as an auction,
where mechanism design [13] theory can be applied. Considering that different user roles
are involved in the sale, and the privacy-preserving methods adopted by data collector
and data miner may have influence on data provider’s decisions, we will review the
applications of game theory and mechanism design after the discussions of other user
roles.

10
2.2.3 PROVIDE FALSE DATA

As discussed above, a data provider can take some measures to prevent data collector
from accessing his sensitive data. However, a disappointed fact that we have to
admit is that no matter how hard they try, Internet users cannot completely stop the
unwanted access to their personal information. So instead of trying to limit the access, the
data provider can provide false information to those untrustworthy data collectors. The
following three methods can help an Internet user to falsify his data:

1) Using ‘‘sock puppets’’ to hide one’s true activities. A sockpuppet12 is a false


online identity though which a member of an Internet community speaks while pre-
tending to be another person, like a puppeteer manipulating a hand puppet. By using
multiple sock puppets, the data produced by one individual’s activities will be deemed as
data belonging to different individuals, assuming that the data collector does not have
enough knowledge to relate different sock puppets to one specific individual. As a result,
the user’s true activities are unknown to others and his sensitive information (e.g.
political preference) cannot be easily discovered.

2) Using a fake identity to create phony information. In 2012, Apple Inc. was
assigned a patient called ‘‘Techniques to pollute electronic profiling’’ [14] which can
help to protect user’s privacy. This patent discloses a method for polluting the
information gathered by ‘‘network eavesdroppers’’ by making a false online identity of a
principal agent, e.g. a service subscriber. The clone identity automatically carries out
numerous online actions which are quite different from a user’s true activities. When a
network eavesdropper collects the data of a user who is utilizing this method, the
eavesdropper will be interfered by the massive data created by the clone identity. Real
information about of the user is buried under the manufactured phony information.

3) Using security tools to mask one’s identity. When a user signs up for a web
service or buys something online, he is often asked to provide information such as email
address, credit card number, phone number, etc. A browser extension called MaskMe,13
which was release by the online privacy company Abine, Inc. in 2013, can help the user

11
to create and manage aliases (or Masks) of these personal information. Users can use
these aliases just like they normally do when such information is required, while the
websites cannot get the real information. In this way, user’s privacy is protected.

12
CHAPTER 3

DATA COLLECTOR
3.1 CONCERNS OF DATA COLLECTOR

As shown in Fig. 1.2, a data collector collects data from data providers in order to support
the subsequent data mining operations. The original data collected from data providers
usually contain sensitive information about individuals. If the data collector doesn’t take
sufficient precautions before releasing the data to public or data miners, those sensitive
information may be disclosed, even though this is not the collector’s original intention.
For example, on October 2, 2006, the U.S. online movie rental service Netflix14
released a data set containing movie ratings of 500,000 subscribers to the public for a
challenging competition called ’’the Netflix Prize". The goal of the competition was to
improve the accuracy of personalized movie recommendations. The released data set
was supposed to be privacy-safe, since each data record only contained a subscriber ID
(irrelevant with the subscriber’s real identity), the movie info, the rating, and the date on
which the subscriber rated the movie. However, soon after the release, two researchers
[16] from University of Texas found that with a little bit of auxiliary information about
an individual subscriber, e.g. 8 movie ratings (of which 2 may be completely wrong) and
dates that may have a 14-day error, an adversary can easily identify the individual’s
record (if the record is present in the data set).

From above example we can see that, it is necessary for the data collector to modify the
original data before releasing them to others, so that sensitive information about data
providers can neither be found in the modified data nor be inferred by anyone with
malicious intent. Generally, the modification will cause a loss in data utility. The data
collector should also make sure that sufficient utility of the data can be retained after the
modification, otherwise collecting the data will be a wasted effort. The data modification
process adopted by data collector, with the goal of preserving privacy and utility
simultaneously, is usually called privacy preserving data publishing (PPDP).

13
Extensive approaches to PPDP have been proposed in last decade. Fung et al. have
systematically summarized and evaluated different approaches in their frequently cited
survey [12]. Also, Wong and Fu have made a detailed review of studies on PPDP in their
monograph [14]. To differentiate with their work, in this paper we mainly focus on how
PPDP is realized in two emerging applications, namely social networks and location-
based services. To make our review more self-contained, in next subsection we will first
briefly introduce some basics of PPDP, e.g. the privacy model, typical anonymization
operations, information metrics, etc, and then we will review studies on social networks
and location-based services respectively.

3.2 APPROACHES TO PRIVACY PROTECTION

3.2.1 BASICS OF PPDP

PPDP mainly studies anonymization approaches for publish- ing useful data while
preserving privacy. The original data is assumed to be a private table consisting of
multiple records. Each record consists of the following 4 types of attributes:

• Identifier (ID): Attributes that can directly and uniquely identify an individual,
such as name, ID number and

mobile number.

• Quasi-identifier (QID): Attributes that can be linked with external data to re-
identify individual records, such as gender, age and zip code.

• Sensitive Attribute (SA): Attributes that an individual wants to conceal, such as


disease and salary.

• Non-sensitive Attribute (NSA): Attributes other than ID, QID and SA.

3.2.2 PRIVACY-PRESERVING PUBLISHING OF SOCIAL


NETWORK DATA

Social networks have gained great development in recent years. Aiming at discovering
interesting social patterns, social network analysis becomes more and more important. To
support the analysis, the company who runs a social net- work application sometimes

14
needs to publish its data to a third party. However, even if the truthful identifiers of
individuals are removed from the published data, which is referred to as naïve
anonymized, publication of the network data may lead to exposures of sensitive
information about individuals, such as one’s intimate relationships with others.
Therefore, the network data need to be properly anonymized before they are published.

A social network is usually modeled as a graph, where the vertex represents an entity and
the edge represents the relationship between two entities. Thus, PPDP in the context of
social networks mainly deals with anonymizing graph data, which is much more
challenging than anonymizing relational table data. Zhou et al. [14] have identified the
following three challenges in social network data anonymization:

First, modeling adversary’s background knowledge about the network is much harder.
For relational data tables, a small set of quasi-identifiers are used to define the attack
models. While given the network data, various information, such as attributes of an
entity and relationships between different entities, may be utilized by the adversary.

Second, measuring the information loss in anonymizing social network data is harder
than that in anonymizing relational data. It is difficult to determine whether the original
network and the anonymized network are different in certain properties of the network.

Third, devising anonymization methods for social network data is much harder than that
for relational data. Anonymizing a group of tuples in a relational table does not affect
other tuples. However, when modifying a network, changing one vertex or edge may
affect the rest of the network. Therefore, ‘‘divide-and-conquer’’ methods, which are
widely applied to relational data, cannot be applied to network data.

To deal with above challenges, many approaches have been proposed. According to [15],
anonymization methods on simple graphs, where vertices are not associated with
attributes and edges have no labels, can be classified into three categories, namely edge
modification, edge randomization, and clustering-based generalization. Comprehensive
surveys of approaches to on social network data anonymization can be found . In this
paper, we briefly review some of the very recent studies, with focus on the following
three aspects: attack model, privacy model, and data utility.

15
3.3.3 ATTACK MODEL

Given the anonymized network data, adversaries usually rely on background knowledge
to de-anonymize individuals and learn relationships between de-anonymized individuals.
Zhou et al. [14] identify six types of the back- ground knowledge, i.e. attributes of
vertices, vertex degrees, link relationship, neighborhoods, embedded sub graphs and
graph metrics. Pengetal. [17] propose an algorithm called Seed-and-Grow to identify
users from an anonymized social graph, based solely on graph structure. The algorithm
first identifies a seed sub-graph which is either planted by an attacker or divulged
by collusion of a small group of users, and then grows the seed larger based on the
adversary’s existing knowledge of users’ social relations. Zhu et al design a structural
attack to de-anonymize social graph data. The attack uses the cumulative degree of

(a) (b)

FIGURE 3.1. Example of mutual friend attack: (a) original network;

(a) naïve anonymized network.

16
(a) (b)

FIGURE 3.2. Example of friend attack: (a) original network; (b) naïve anonymized
network.

n-hop neighbors of a vertex as the regional feature, and com- bines it with the simulated
annealing-based graph matching method to re-identify vertices in anonymous social
graphs. Sun et al. introduce a relationship attack model called mutual friend attack,
which is based on the number of mutual friends of two connected individuals. Fig. 3.2
shows an example of the mutual friend attack. The original social network G with vertex
identities is shown in Fig. 3.2(a), and Fig. 3.2(b) shows the corresponding anonymized
network where all individuals’ names are removed. In this network, only Aliceand Bob
have 4 mutual friends. If an adversary knows this information, then he can uniquely re-
identify the edge (D, E) in Fig. 3.2(b) is (Alice, Bob). In Tai et al. investigate the
friendship attack where an adversary utilizes the degrees of two vertices connected by an
edge to re-identify related victims in a published social network data set. Fig. 3.3 shows
an example of friendship attack. Suppose that each user’s friend count (i.e. the degree of
the vertex) is publicly available. If the adversary knows that Bob has 2 friends and Carl
has 4 friends, and he also knows that Bob and Carl are friends, then he can uniquely
identify that the edge (2, 3) in Fig. 3.3(b) corresponds to (Bob, Carl). In another type of
attack, namely degree attack, is explored. The motivation is that each individual in a
social network is inclined to associate with not only a vertex identity but also a
community identity, and the community identity reflects some sensitive information
about the individual. It has been shown that, based on some background knowledge about
vertex degree, even if the adversary cannot precisely identify the vertex corresponding to
an individual, community information and neighborhood information can still be inferred.

17
For example, the network shown in Fig.3.5 consists of two communities, and the
community identity reveals sensitive information (i.e. disease status) about its members.
Suppose that an adversary knows Jhon has 5 friends, then he can infer that Jhon has
AIDS, even though he is not sure which of the two vertices

(a) (b)

Figure 3.3 : Example of degree attack: (a) original network; (b) naïve anonymized
network

(a) (b) (c)

FIGURE 3.4: Examples of k-NMF anonymity: (a) 3-NMF; (b) 4-NMF;

(c) 6-NMF.

18
(a) (b) (c)

FIGURE 3.5. Examples of k2-degree anonymous graphs: (a) 22-degree;

(b) 32-degree; (c) 22-degree.

(vertex 2 and vertex 3) in the anonymized network (Fig. 3.5(b)) corresponds to Jhon.
From above discussion we can see that, the graph data contain rich information that can
be explored by the adversary to initiate an attack. Modeling the background knowledge of
the adversary is difficult yet very important for deriving the privacy models.

3.3.4. PRIVACY MODEL

Based on the classic k-anonymity model, a number of privacy models have been
proposed for graph data. Some of the models have been summarized in the survey , such
as k-degree, k-neighborhood, k-automorphism, k-isomorphism, and k-symmetry. In order
to protect the privacy of relationship from the mutual friend attack, Sun et al. introduce a
variant of k-anonymity, called k-NMF anonymity. NMF is a property defined for the
edge in an undirected simple graph, representing the number of mutual friends between
the two individuals linked by the edge. If a network satisfies k-NMF anonymity (see Fig.
3.6), then for each edge e, there will be at least k 1 other edges with the same number
of mutual friends as e. It can be guaranteed that the probability of an edge being
identified is not greater than 1/k. Tai et al. [13] introduce the concept of

k2-degree anonymity to prevent friendship attacks. A graph G¯

is k2-degree anonymous if, for every vertex with an incident edge of degree pair (d1, d2)
in G¯ , there exist at least k 1

other vertices, such that each of the k 1 vertices also has an incident edge of the same
degree pair (see Fig. 8). Intuitively, if a graph is k2-degree anonymous, then the

19
probability of a vertex being re-identified is not greater than 1/k, even if an adversary
knows a certain degree pair (dA, dB), where

(a) (b)

FIGURE 3.6. Examples of 2-structurally diverse graphs, where the community ID is


indicated beside each vertex.

A and B are friends. To prevent degree attacks, Tai et al. [13] introduce the concept of
structural diversity. A graph satisfies k-structural diversity anonymization (k-SDA), if for
every vertex v in the graph, there are at least k communities, such that each of the
communities contains at least one vertex with the same degree as v (see Fig. 3.6). In other
words, for each

vertex v, there are at least k − 1 other vertices located in at least k − 1 other communities.

20
3.3.5 DATA UTILITY

In the context of network data anonymization, the implication of data utility is: whether
and to what extent properties of the graph are preserved. Wu et al. [15] summarize three
types of properties considered in current studies. The first type is graph topological
properties, which are defined for applications aiming at analyzing graph properties.
Various measures have been proposed to indicate the structure characteristics of the
network. The second type is graph spectral properties. The spectrum of a graph is usually
defined as the set of eigen values of the graph’s adjacency matrix or other derived
matrices, which has close relations with many graph characteristics. The third type is
aggregate network queries. An aggregate network query calculates the aggregate on some
paths or sub graphs satisfying some query conditions. The accuracy of answering
aggregate network queries can be considered as the measure of utility preservation. Most
existing k-anonymization algorithms for network data publishing perform edge insertion
and/or deletion operations, and they try to reduce the utility loss by minimizing the
changes on the graph degree sequence. Wang et al. [13] consider that the degree sequence
only captures limited structural properties of the graph and the derived anonymization
methods may cause large utility loss. They propose utility loss measurements built on the
community-based graph models, including both the flat com- munity model and the
hierarchical community model, to better capture the impact of anonymization on network
topology. One important characteristic of social networks is that they keep evolving over
time. Sometimes the data collector needs to publish the network data periodically. The
privacy issue in sequential publishing of dynamic social network data has recently
attracted researchers’ attention. Med forth and Wang [14] identify a new class of privacy
attack, named degree-trail attack, arising from publishing a sequence of graph data. They
demonstrate that even if each published graph is anonymized by strong privacy
preserving techniques, an adversary with little background knowledge can re-identify the
vertex belonging to a known target individual by comparing the degrees of vertices in the
published graphs with the degree evolution of a target. In [15], Tai et al. adopt the same
attack model used in [34], and pro- pose a privacy model called dynamic kw structural
diversity anonymity (kw-SDA), for protecting the vertex and multi-community identities

21
in sequential releases of a dynamic network. The parameter k has a similar implication as
in the original k-anonymity model, and w denotes a time period that an adversary can
monitor a target to collect the attack knowledge. They develop a heuristic algorithm for
generating releases satisfying this privacy requirement.

22
CHAPTER 4

DATA MINER

4.1 CONCERNS OF DATA MINER

In order to discover useful knowledge which is desired by the decision maker, the data
miner applies data mining algorithms to the data obtained from data collector. The
privacy issues coming with the data mining operations are twofold. On one hand, if
personal information can be directly observed in the data and data breach happens,
privacy of the original data owner (i.e. the data provider) will be compromised. On the
other hand, equipping with the many powerful data mining techniques, the data miner is
able to find out various kinds of information underlying the data. Sometimes the data
mining results may reveal sensitive information about the data owners. For example, in
the Target story we mentioned in Section I-B, the information about the daughter’s
pregnancy, which is inferred by the retailer via mining customer data, is something that
the daughter does not want others to know. To encourage data providers to participate
in the data mining activity and provide more sensitive data, the data miner needs to make
sure that the above two privacy threats are eliminated, or in other words, data providers’
privacy must be well preserved. Different from existing surveys on privacy-preserving
data mining (PPDM), in this paper, we consider it is the data collector’s responsibility to
ensure that sensitive raw data are modified or trimmed out from the published data. The
primary concern of data miner is how to prevent sensitive information from appearing in
the mining results. To perform a privacy-preserving data mining, the data miner usually
needs to modify the data he got from the data collector. As a result, the decline of data
utility is inevitable. Similar to data collector, the data miner also faces the privacy-utility
trade-off problem. But in the context of PPDM, quantifications of privacy and utility are
closely related to the mining algorithm employed by the data miner.

23
4.2 APPROACHES TO PRIVACY PROTECTION

Extensive PPDM approaches have been proposed (see [5]–[7] for detailed surveys).
These approaches can be classified by different criteria [13], such as data distribution,
data modification method, data mining algorithm, etc. Based on the distribution of data,
PPDM approaches can be classified into two categories, namely approaches for
centralized data mining and approaches for distributed data mining. Distributed data
mining can be further categorized into data mining over horizontally partitioned data and
data mining over vertically partitioned data. Based on the technique adopted for data
modification, PPDM can be classified into perturbation-based, blocking-based,
swapping- based, etc. Since we define the privacy-preserving goal of data miner as
preventing sensitive information from being revealed by the data mining results, in this
section, we classify PPDM approaches according to the type of data mining tasks.
Specifically, we review recent studies on privacy-preserving association rule mining,
privacy-preserving classification, and privacy-preserving clustering, respectively.

Since many of the studies deal with distributed data mining where secure multi-party
computation is widely applied, here we make a brief introduction of secure multi-party
computation (SMC). SMC is a subfield of cryptography. In general, SMC assumes a
number of participants P1, P2, . . . , Pm, each has a private data, X1, X2, . . . , Xm.
The participants want to compute the value of a public function f on m variables at the
point X1, X2, . . . , Xm. A SMC protocol is called secure, if at the end of the
computation, no participant knows anything except his own data and the results of global
calculation. We can view this by imagining that there is a trusted-third-party (TTP).
Every participant give his input to the TTP, and the TTP performs the computation and
sends the results to the participants. By employing a SMC protocol, the same result can
be achieved without the TTP. In the context of distributed data mining, the goal of SMC
is to make sure that each participant can get the correct data mining result without
revealing his data to others.

24
CHAPTER 5

DECISION MAKER

5.1 CONCERNS OF DECISION MAKER

The ultimate goal of data mining is to provide useful information to the decision maker,
so that the decision maker can choose a better way to achieve his objective, such as
increasing sales of products or making correct diagnoses of diseases. At a first glance, it
seems that the decision maker has no responsibility for protecting privacy, since we
usually interpret privacy as sensitive information about the original data owners (i.e. data
providers). Generally, the data miner, the data collector and the data provider himself are
considered to be responsible for the safety of privacy. However, if we look at the privacy
issue from a wider perspective, we can see that the decision maker also has his own
privacy concerns. The data mining results provided by the data miner are of high
importance to the decision maker. If the results are disclosed to someone else, e.g. a
competing company, the decision maker may suffer a loss. That is to say, from the
perspective of decision maker, the data mining results are sensitive information. On the
other hand, if the decision maker does not get the data mining results directly from the
data miner, but from someone else which we called information transmitter, the decision
maker should be skeptical about the credibility of the results, in case that the results have
been distorted. Therefore, the privacy concerns of the decision maker are twofold: how to
prevent unwanted disclosure of sensitive mining results, and how to evaluate the
credibility of the received mining results.

5.2 APPROACHES TO PRIVACY PROTECTION

To deal with the first privacy issue proposed above, i.e. to prevent unwanted disclosure
of sensitive mining results, usually the decision maker has to resort to legal measures. For

25
example, making a contract with the data miner to forbid the miner from disclosing the
mining results to a third party. To handle the second issue, i.e. to determine whether the
received information can be trusted, the decision maker can utilize methodologies from
data provenance, credibility analysis of web information, or other related research fields.
In the rest part of this section, we will first briefly review the studies on data provenance
and web information credibility, and then present a preliminary discussion about how
these studies can help to analyze the credibility of data mining results.

5.2.1 DATA PROVENANCE

If the decision maker does not get the data mining results directly from the data miner, he
would want to know how the results are delivered to him and what kind of modification
may have been applied to the results, so that he can determine whether the results can be
trusted. This is why ‘‘provenance’’ is needed. The term provenance originally refers to
the chronology of the ownership, custody or location of a historical object. In information
science, a piece of data is treated as the historical object, and data provenance refers to
the information that helps determine the derivation history of the data, starting from the
original source [18]. Two kinds of information can be found in the provenance of the
data: the ancestral data from which current data evolved, and the transformations applied
to ancestral data that helped to produce current data. With such information, people can
better understand the data and judge the credibility of the data.

Since 1990s, data provenance has been extensively studied in the fields of databases and
workflows. Several surveys are now available. In [18], Simmhan et al. present a
taxonomy of data provenance techniques. The following five aspects are used to capture
the characteristics of a provenance system:

• Application of provenance. Provenance systems may be constructed to support a


number of uses, such as estimate data quality and data reliability, trace the audit trail of
data, repeat the derivation of data, etc.

• Subject of provenance. Provenance information can be collected about different


resources present in the data processing system and at various levels of detail.

26
• Representation of provenance. There are mainly two types of methods to
represent provenance information, one is annotation and the other is inversion. The
annotation method uses metadata, which comprise of the derivation history of the data, as
annotations and descriptions about sources data and processes. The inversion method
uses the property by which some derivations can be inverted to find the input data
supplied to derive the output data.

• Provenance storage. Provenance can be tightly coupled to the data it describes and
located in the same data storage system or even be embedded within the data file.
Alternatively, provenance can be stored separately with other metadata or simply by
itself.

• Provenance dissemination. A provenance system can use different ways to


disseminate the provenance information, such as providing a derivation graph that users
can browse and inspect.

In [12], Glavic et al. present another categorization scheme for provenance system. The
proposed scheme consists of three main categorizes: provenance model, query and
manipulation functionality, storage model and recording strategy. Davidson and Freire
review studies on provenance for scientific workflows. They summarize the key
components of a provenance management solution, discuss applications for workflow
provenance, and outline a few open problems for database-related research.

As Internet becomes a major platform for information sharing, provenance of Internet


information has attracted some attention. Researchers have developed approaches for
information provenance in semantic weband social media. Hartig proposes a provenance
model that captures both the information about web-based data access and information
about the creation of data. In this model, an ontology-based vocabulary is developed to
describe the provenance information. Moreau [95] reviews research issues related to
tracking provenance in semantic web from the following four aspects: publishing
provenance on the web; using semantic web technologies to facilitate provenance
acquisition, representation, and reasoning; tracking the provenance of RDF (resource
description framework)-based information; tracking the provenance of inferred

27
knowledge. Barbier and Liu study the information provenance problem in social media.
They model the social network as a directed graph G(V, E, p), where V is the node set
and E is the edge set. Each node in the graph represents an entity and each directed edge
represents the direction of information propagation. An information propagation
probability p is attached to each edge. Based on the model, they define the information
provenance problem as follows: given a directed graph G(V, E, p), with known
terminals T V , and a positive integer constant k Z +, identify the sources S V,
such that S k, and U (S, T) is maximized. The function U (S, T) estimates the utility
of information propagation which starts from the sources S and stops at the terminals. To
solve this provenance problem, one can leverage the unique features of social networks,
e.g. user profiles, user interactions, spatial or temporal information, etc. Two approaches
are developed to seek the provenance of information. One approach utilizes the network
information to directly seek the provenance of information, and the other approach aims
at finding the reverse flows of information propagation.

The special characteristics of Internet, such as openness, freedom and anonymity, pose
great challenges for seeking provenance of information. Compared to the approaches
developed in the context of databases and workflows, current solutions proposed for
supporting provenance in Internet environment are less mature. There are still many
problems to be explored in future study.

5.2.2 WEB INFORMATION CREDIBILITY

Because of the lack of publishing barriers, the low cost of dissemination, and the lax
control of quality, credibility of web information has become a serious issue. Tudjman et
al. [17] identify the following five criteria that can be employed by Internet users to
differentiate false information from the truth:

• Authority: the real author of false information is usually unclear.

• Accuracy: false information does not contain accurate data or approved facts.

• Objectivity: false information is often prejudicial.

28
• Currency: for false information, the data about its source, time and place of its
origin is incomplete, out of date, or missing.

• Coverage: false information usually contains no effective links to other


information online.

In [98], Metzger summarizes the skills that can help users to assess the credibility of
online information.

With the rapid growth of online social media, false information breeds more easily and
spreads more widely than before, which further increases the difficulty of judging
information credibility. Identifying rumors and their sources in micro blogging networks
has recently become a hot research topic .Current research usually treats rumor
identification as a classification problem, thus the following two issues are involved:

• Preparation of training data set. Current studies usually take rumors that have
been confirmed by authorities as positive training samples. Considering the huge amount
of messages in micro blogging networks, such training samples are far from enough to
train a good classifier. Building a large benchmark data set of rumors is in urgent need.

• Feature selection. Various kinds of features can be used to characterize the micro
blogging messages. In cur- rent literature, the following three types of features are often
used: content-based features, such as word unigram/bigram, part-of-speech
unigram/bigram, text length, number of sentiment word (positive/negative), number of
URL, and number of hash tag; user-related features, such as registration time, registration
location, number of friends, number of followers, and number of messages posted by the
user; network features, such as number of comments and number of re tweets.

So far, it is still quite difficult to automatically identifying false information on the


Internet. It is necessary to incorporate methodologies from multiple disciplines, such as
nature language processing, data mining, machine learning, social networking analysis,
and information provenance, into the identification procedure.

29
CONCLUSION

We review the privacy issues related to data mining by using a user-role based
methodology. We differentiate four different user roles that are commonly involved in
data mining applications, i.e. data provider, data collector, data miner and decision
maker. Each user role has its own privacy concerns hence the privacy-preserving
approaches adopted by one user role are generally different from those adopted by others:

• For data provider, his privacy-preserving objective is to effectively control the


amount of sensitive data revealed to others. To achieve this goal, he can utilize security
tools to limit other’s access to his data, sell his data at auction to get enough
compensations for privacy loss, or falsify his data to hide his true identity.

• For data collector, his privacy-preserving objective is to release useful data to data
miners without disclosing data providers’ identities and sensitive information about them.
To achieve this goal, he needs to develop proper privacy models to quantify the possible
loss of privacy under different attacks, and apply anonymization techniques to the data.

• For data miner, his privacy-preserving objective is to get correct data mining
results while keep sensitive information undisclosed either in the process of data mining
or in the mining results. To achieve this goal, he can choose a proper method to modify
the data before certain mining algorithms are applied to, or utilize secure computation
protocols to ensure the safety of private data and sensitive information contained in the
learned model.

• For decision maker, his privacy-preserving objective is to make a correct


judgement about the credibility of the data mining results he’s got. To achieve this goal,
he can utilize provenance techniques to trace back the history of the received information,
or build classifier to discriminate true information from false information.

30
To achieve the privacy-preserving goals of different users roles, various methods from
different research fields are required. We have reviewed recent progress in related
studies, and discussed problems awaiting to be further investigated. We hope that the
review presented in this paper can offer researchers different insights into the issue of
privacy-preserving data mining, and promote the exploration of new solutions to the
security of sensitive information.

31
REFERENCES

[1] J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques. San
Mateo, CA, USA: Morgan Kaufmann, 2006.

[2] L. Brankovic and V. Estivill-Castro, ‘‘Privacy issues in knowledge discov-

ery and data mining,’’ in Proc. Austral. Inst. Comput. Ethics Conf., 1999, pp. 89–99.

[3] R. Agrawal and R. Srikant, ‘‘Privacy-preserving data mining,’’ ACM

SIGMOD Rec., vol. 29, no. 2, pp. 439–450, 2000.

[4] Y. Lindell and B. Pinkas, ‘‘Privacy preserving data mining,’’ in Advances in


Cryptology. Berlin, Germany: Springer-Verlag, 2000, pp. 36–54.

[5] C. C. Aggarwal and S. Y. Philip, A General Survey of Privacy-

Preserving Data Mining Models and Algorithms. New York, NY, USA: Springer-Verlag,
2008.

[6] M. B. Malik, M. A. Ghazi, and R. Ali, ‘‘Privacy preserving data mining

techniques: Current scenario and future prospects,’’ in Proc. 3rd Int. Conf. Comput.
Commun. Technol. (ICCCT), Nov. 2012, pp. 26–32.

[7] S. Matwin, ‘‘Privacy-preserving data mining techniques: Survey and chal-

lenges,’’ in Discrimination and Privacy in the Information Society. Berlin, Germany:


Springer-Verlag, 2013, pp. 209–221.

[8] E. Rasmusen, Games and Information: An Introduction to Game Theory,

vol. 2. Cambridge, MA, USA: Blackwell, 1994.

32
[9] V. Ciriani, S. De Capitani di Vimercati, S. Foresti, and P. Samarati, ‘‘Microdata
protection,’’ in Secure Data Management in Decentralized Systems. New York, NY,
USA: Springer-Verlag, 2007, pp. 291–321.

[10] O. Tene and J. Polenetsky, ‘‘To track or ‘do not track’: Advancing

transparency and individual control in online behavioral advertising,’’

Minnesota J. Law, Sci. Technol., no. 1, pp. 281–357, 2012.

[11] R. T. Fielding and D. Singer. (2014). Tracking Preference Expression (DNT).


W3C Working Draft. [Online]. Available: http://www.w3.org/ TR/2014/WD-tracking-
dnt-20140128/

[12] R. Gibbons, A Primer in Game Theory. Hertfordshire, U.K.: Harvester

Wheatsheaf, 1992.

[13] D. C. Parkes, ‘‘Iterative combinatorial auctions: Achieving economic and


computational efficiency,’’ Ph.D. dissertation, Univ. Pennsylvania, Philadelphia, PA,
USA, 2001.

[14] S. Carter, ‘‘Techniques to pollute electronic profiling,’’ U.S. Patent

11/257 614, Apr. 26, 2007. [Online]. Available: https://www.google.com/


patents/US20070094738

[15] Verizon Communications Inc. (2013). 2013 Data Breach Investiga-

tions Report. [Online]. Available: http://www.verizonenterprise.com/


resources/reports/rp_data-breach-investigations-report-2013_en_xg.pdf

[16] A. Narayanan and V. Shmatikov, ‘‘Robust de-anonymization of large

sparse datasets,’’ in Proc. IEEE Symp. Secur. Privacy (SP), May 2008, pp. 111–125.

[17] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu, ‘‘Privacy-preserving

33
data publishing: A survey of recent developments,’’ ACM Comput. Surv., vol. 42, no. 4,
Jun. 2010, Art. ID 14.

[18] R. C.-W. Wong and A. W.-C. Fu, ‘‘Privacy-preserving data publish-

ing: An overview,’’ Synthesis Lectures Data Manage., vol. 2, no. 1, pp. 1–138, 2010.

34

You might also like