0% found this document useful (0 votes)

64 views

Topic 1: Introduction To Data Mining: Instructor: Chris Volinsky

This document provides an introduction to a data mining course. It outlines the course objectives, which include learning data mining techniques and their applications, understanding the limitations of standard statistical techniques, and implementing data mining models using statistical software. The document also describes the course assignments, which include homework, exams, and a semester-long data mining project where students define a question, collect and analyze data, and write a report on their findings. Key data mining software and resources are also introduced.

Uploaded by

Salom

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

64 views

Topic 1: Introduction To Data Mining: Instructor: Chris Volinsky

Uploaded by

Salom

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 59

Topic 1:

Introduction to Data Mining

Instructor: Chris Volinsky

DataMiningColumbiaUniversity

Intro
Who am I?

Who are you?

DataMiningColumbiaUniversity

Class Schedule
Sept 8 December 8
No class Election Day or Thanksgiving

Syllabus:
www.research.att.com/~volinsky/DataMining/Columbia2011/Co
lumbia2011.html
My email: [email protected]
My phone: 973-360-8644
My office hours: by appointment before or after class
DataMiningColumbiaUniversity

Class Assessment
30% HW
Due every two weeks
1st HW due next Thursday September 15
No late HW accepted

40% Tests
Midterm and Final

30% Data Mining Project

Proposal due in October
Project due Tuesday Dec 13
DataMiningColumbiaUniversity

Course Objectives
Direct Objectives:
To learn data mining techniques
To see their use in real-world/research applications
To understand limitations of standard statistical
techniques in data mining applications
To get an understanding of the methodological
principles behind data mining
To be able to read about data mining in the popular
press with a critical eye
To implement & use data mining models using
statistical software

DataMiningColumbiaUniversity

Data Analysis Project

The goal of data mining is to find interesting patterns
in data. You will be required to:

Define a scientific question of interest

Collect a data set n>1000 (probably online)
Prepare the data set properly
Analyze the data using appropriate models
Write a 10-20 page report on your analysis (graphics
included)

Project proposals (1/2 -1 page) will be due in early

October.
Volunteers to present projects in class for extra
credit.
Finished reports will be due December 13.
DataMiningColumbiaUniversity

Data Mining Software

Software
Can use any software you like must know how to input,
manipulate, graph, and analyze data.
Preferred: R
Also: SAS, Weka, SPSS, Systat, Enterprise Miner, JMP, Minitab,
Matlab, SQL Server
Maybe not: Excel, C

What is R?

Open source statistical software grown out of S/Splus

www.r-project.org
Many user-contributed packages at CRAN (cran.r-project.org)
Active, helpful user community (help lists, bulletin boards, etc)
R Tutorials available online (see class website and CRAN)
Great graphics (with a bit of a learning curve)

Other useful tools: Perl/Python, AWK, Shell scripts

DataMiningColumbiaUniversity

Resources
Data mining is a new field and as such, does not have
authoritative texts (yet).
This class draws from many sources, best are
Elements of Statistical Learning Hastie, Tibshirani, and
Friedman
Handbook of Data Mining Hand, Mannila and Smyth
Interactive and Dynamic Graphics for Data Analysis Cook and
Swayne
Data Mining Practical Machine Learning Tools and
Techniques Witten and Frank
Also good class notes available from other classes:

David Madigan, Columbia

Di Cook, Iowa State
Padhraic Smyth, UC Irvine
Jiawei Han, Simon Fraser

see class web site for pointers to these notes, or just Google them!)

Also a few good books which teach stats/DM through R:

The R Book Crawley

A Handbook of Statistical Analyses Using R Evirtt and Hothorn
Modern Applied Statistics Using S-Plus Venables and Ripley

DataMiningColumbiaUniversity

Course Outline
Each unit covers two lectures
Units:

Intro to Data Mining

Data exploration and visualization
Data Mining Concepts
Regression Topics
Classification and Supervised Learning
Clustering and Unsupervised Learning
Text Mining and Information Retrieval
Web Mining
Social Networks
Assorted Topics
Advanced Classification Neural networks, Support Vector
machines
Ensemble methods
Recommender Systems
Fraud

DataMiningColumbiaUniversity

What is Data Mining?

Not well defined.
No one can agree on what data mining is! In fact
the experts have very different descriptions:
finding interesting structure (patterns, statistical models,
relationships) in data bases. - Fayyad, Chaduriand
the nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data. Fayyad
a knowledge discovery process of extracting previously
unknown, actionable information from very large data bases
Zorne
a process that uses a variety of data analysis tools to discover
patterns and relationships in data that may be used to make
valid predictions.--- Edelstein

DataMiningColumbiaUniversity

What is Data Mining

From Zaiane:
Data Mining, also popularly known as Knowledge Discovery in
Databases (KDD)...
The Knowledge Discovery in Databases process comprises of a
few steps leading from raw data collections to some form of new
knowledge. The iterative process consists of the following steps:

Data cleaning: ...

Data integration: ...
Data selection: ...
Data transformation: ...
Data mining: it is the crucial step in which clever techniques are applied
to extract patterns potentially useful.
Pattern evaluation: ...
Knowledge representation: ...

DataMiningColumbiaUniversity

What is Data Mining?

What does the authority say?
Data mining is the process of extracting hidden patterns from
data.
Data mining is the process of discovering new patterns from
large data sets involving methods from statistics and artificial
intelligence but also database management.

Hand, Mannila, Smyth:

data mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are
both understandable and useful to the data owner

Isnt that the same as statistics?

DataMiningColumbiaUniversity

Data Mining vs. Statistics

Snark: Data Mining = Statistics + Marketing
Statistics is known for:

well defined hypotheses used to learn about a

specifically chosen population studied using
carefully collected data providing inferences with
well known properties.

Data mining isnt that careful. It is:

data driven discovery of

models and patterns from
massive and
observational data sets
DataMiningColumbiaUniversity

Data Mining v. Statistics

Traditional statistics

first hypothesize, then collect data, then analyze

often model-oriented (strong parametric models)
Focused on understanding

Data mining (also Machine Learning):

few if any a priori hypotheses

data is usually already collected a priori
analysis is typically data-driven not hypothesis-driven
Often algorithm-oriented rather than model-oriented
Focused on prediction

But

statistical ideas are very useful in data mining, e.g., in validating whether
discovered knowledge is useful
Increasing overlap at the boundary of statistics and DM
Cultures could learn from each other
Very powerful when used together

DataMiningColumbiaUniversity

Data Mining Enablers

Explosion of data
Fast and cheap computation and storage
Moores Law: processing doubles every two years
Disk storage doubles every 9 months
Database technology

Competitive pressure in business

Data has value! Successes are widely publicized

Commercial products
SAS, SPSS, Google Analytics, IBM, Oracle

Open Source products

Weka
R

Dont need a data mining expert to do data

mining!
DataMiningColumbiaUniversity

Data-Driven Discovery
Observational data
cheap relative to experimental data
Examples:
Retail stores, airlines, etc
Amazon, Google, etc
Do iPhone users use more data than Android users?

makes sense to leverage available, observational

data

What are the perils of observational data?

Easy to do pseudo-experiments
Observational data can also help in
hypothesis formulation.
DataMiningColumbiaUniversity

Data Mining: Confluence of Multiple

Disciplines
Database
Technology

Machine
Learning

Statistics

Data Mining

Information
Science

Visualization

Other
Disciplines

Different fields have different views of what data mining is

(also different terminology!)
DataMiningColumbiaUniversity

Data Data Data

Its all about the data - where does it
come from?

www
NASA
Business processes/transactions
Telecommunications and networking
Medical imagery
Government, census, demographics
(data.gov!)
Sensor networks, RFID tags
sports
DataMiningColumbiaUniversity

Types of Data: Flat File or Vector

Data

n
p
Rows = objects
Columns = measurements on objects

Represent each row as a p-dimensional vector, where p is the

dimensionality
In efffect, embed our objects in a p-dimensional vector space

Both n and p can be very large in data mining (also p>>n)

Matrix can be quite sparse
DataMiningColumbiaUniversity

Types of Data: TextData

Canbe
representedasa
sparsematrix

Obama

Text
Documents

The Help

Word IDs
DataMiningColumbiaUniversity

Transactional Data
Datestampedevents(weblogs,phonecalls):
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,

Canberepresentedasatimeseries:
User 1
User 2
User 3
User 4
User 5

2
3
7
1
5

3
3
7
5
1

2
3
7
1
1

2
1
7
1
5

3
1
7
1

3 3 1 1 1 3 1 3 3 3 3
1
7 7 7
5 1 5 1 1 1 1 1 1

DataMiningColumbiaUniversity

TypesofData:Relational Data
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
,

128.195.36.195, Doe, John, 12 Main St, 973-462-3421, Madison, NJ, 07932

114.12.12.25,Trank, Jill, 11 Elm St, 998-555-5675, Chester, NJ, 07911

07911, Chester, NJ, 07954, 34000, , 40.65, -74.12

07932, Madison, NJ, 56000, 40.642, -74.132

Mostlargedatasetsarestoredinrelationaldatasets
Specialdataquerylanguage:SQL
Oracle,MSFT,IBM
Goodopensourceversions:MySQL,PostGres
DataMiningColumbiaUniversity

Types of Data: Time Series Data

Oftenmanytime
series,longtimeseries,
ormultivariatetime
series

DataMiningColumbiaUniversity

Time Series: Ebay Data

Jank,Shmueli,etal(2005)

DataMiningColumbiaUniversity

Types of Data: Image Data

DataMiningColumbiaUniversity

Spatio Temporal
Data

http://senseable.mit.edu/nyte/movies/nyte-globe-encounters.movencounters.mov

DataMiningColumbiaUniversity

Network Data: Physical Network

DataMiningColumbiaUniversity

Network Data: Derived Social

Network

Algorithms for estimating relative importance in networks

S. White and P. Smyth, ACM SIGKDD, 2003.
DataMiningColumbiaUniversity

Social Network: Real social

network

HP Labs email
network
500 people, 20k
relationships

DataMiningColumbiaUniversity

Examples of Data Mining

Successes
Market Basket (WalMart)
Recommender Systems (Amazon.com)
Fraud Detection in Telecommunications
(AT&T)
Target Marketing / CRM
Financial Markets
DNA Microarray analysis (or is it?)
Web Traffic / Blog analysis

DataMiningColumbiaUniversity

Examples of Data Mining

Successes

Google is a company built on data mining

PageRank mined the web to build better
search
Google as spell checker
Google as ad placer
Google as news aggregator
Google as face recognizer
DataMiningColumbiaUniversity

The Data Mining Process

Often called KDD - Knowledge
Discovery in Databases
Analysis is just one part of the process

Data collection and storage

Data cleaning
Data sampling
Analysis
Decision making

DataMiningColumbiaUniversity

Different Data Mining Tasks

Exploratory Data Analysis
Descriptive Modeling
Predictive Modeling
Discovering Patterns and Rules
+ others.
DataMiningColumbiaUniversity

Exploratory Data Analysis

Before you model what do you do?
Must check your data

Compute summary statistics: range, max, min, mean,

median, variance, skewness,..

Missing values, outliers, skewness, etc

What types of variables do you have?

Visualization is widely used

1d histograms
2d scatter plots
Higher-dimensional methods

Simple exploratory analysis can be extremely

valuable

Always look at your data before applying any data

mining algorithms
DataMiningColumbiaUniversity

Example of Exploratory Data Analysis

LanguagesoftheWorldWideWebGoogleResearchBlogJuly,2011

DataMiningColumbiaUniversity

Descriptive Modeling
Goal is to build a descriptive model
e.g., a model that could simulate the data if
needed
models the underlying process

Examples:
Density estimation:
estimate the joint distribution P(x1,xp)

Cluster analysis:
Find natural groups in the data

Dependency models among the p variables

Learning a Bayesian network for the data
DataMiningColumbiaUniversity

Example of Descriptive Modeling

Hemoglobinvs.cellvolume

Control Group

Anemia Group

DataMiningColumbiaUniversity

Example of Descriptive Modeling

Control Group

Anemia Group

DataMiningColumbiaUniversity

Predictive Modeling
Predict one variable Y given a set of other variables X
Here X could be a p-dimensional vector
Classification: Y is categorical
Regression: Y is real-valued

In effect this is function approximation, learning the

relationship between Y and X

In data mining, the emphasis is on predictive accuracy,

not on understanding the model

DataMiningColumbiaUniversity

Predictive Modeling: Fraud

Detection
Telecommunications fraud detection

Fraud costs companies US$ Billions per year

very few transactions are fraudulent, but they are costly

Approach

For each transaction estimate fraudiness.

Based on known fraud AND known user behavior
High probability cases investigated by fraud police

Example models:

Credit card usage profiling

anomaly detection
guilt by association

DataMiningColumbiaUniversity

Pattern Discovery
Goal is to discover interesting local patterns in
the data rather than to characterize the data
globally
given market basket data we might discover that
If customers buy wine and bread then they buy cheese with
probability 0.9
These are known as association rules
This was how data mining was born.
But I dont like it

Other examples:
Astronomy
Finance

DataMiningColumbiaUniversity

Example of Pattern Discovery

IBM Advanced Scout System
Bhandari et al. (1997)
Every NBA basketball game is annotated,
e.g., time = 6 mins, 32 seconds
event = 3 point basket
player = Michael Jordan
This creates a huge untapped database of information

IBM algorithms search for rules of the form

If player A is in the game, player Bs scoring rate
increases from 3.2 points per quarter to 8.7 points
per quarter

DataMiningColumbiaUniversity

Data Mining Pitfalls

Is data mining always necessary
Just because you have a terabyte doesnt
mean you need to use it.

Privacy concerns
Differ by country, industry, application,
generation

Meaningfulness of patterns unclear

Rhine paradox
Terrorism
DM has a lot to learn from statistics!
DataMiningColumbiaUniversity

Rhine Paradox
David Rhine: parapsychologist who studied
ESP (he was a believer!)
He devised an experiment where subjects were
asked to guess 10 hidden cards --- red or blue.
Reported: 1 in 1000 people have ESP
He told these people they had ESP and called
them in for another test of the same type.
What do you think happened?
What is the conclusion?

DataMiningColumbiaUniversity

Data Mining Pitfalls

PR Problems: data mining as a four letter word?
...increasingly peoples data is at risk. The old ways ...are still at
use like dumpster diving, stealing from mailboxes, physical theft,
and credit card receipt copying. New tactics include disparate
techniques of phishing, email fraud, data mining, spam, keylogging and an array of other technological processes. - Steven D.
Domenikos, IdentityTruth, 2008

One place oversight is sorely lacking is in the whole matter of data

mining. ...What have they contributed? Not a single case comes to
mind in which security services apprehended a terrorist following
identification by data mining. ...that huge database will be out
there, win or lose, for some government agency to divert to its
purposes or some hacker to turn to private gain or crime. - John
Prados, TomPaine.com

DataMiningColumbiaUniversity

Fighting Terrorism in the US

US Government is widely known to be collecting lots of
data on Americans and using data mining to look for
patterns consistent with terrorist activity.
Bruce Schneier, Wired Magazine, Why Data Mining Wont
Stop Terror:
Assume:
1 in 100 false positive (99% precision)
1 in 1000 false negative
1 trillion events (phone calls, credit card transactions, emails)
per day
10 are really terrorist plots

Then:
1 billion false alarms for every true plot uncovered
27 million leads daily
Even if 99.9999% precision = 2,750 false alarms
DataMiningColumbiaUniversity

Data Mining v. Privacy

There is often tension between data
mining and personal privacy:
http://www.aclu.org/pizza/images/scree
n.swf

Now, some case studies.

DataMiningColumbiaUniversity

Risk v. Reward in Data Mining

More data about more people in fewer places

DataMiningColumbiaUniversity

The risks of research

My own personal story:
orhow a paper published in JCGS leads
me to be connected to FBI wiretapping.

2001-2005: Publish papers on Communities of Interest using

social networks and Guilt by association to catch fraud
9 September 2007: NYT lead story F.B.I. Data Mining Reached
Beyond Initial Targets discusses FBI techniques COI and GBA
23 October 2007: Blogosphere erupts: How AT&T Provides the FBI
with Terror Suspect Leads
DataMiningColumbiaUniversity

The Good, The Bad, and the Maybe

The question remains: how do we
effectively leverage sensitive personal
data for research purposes?
Three case studies can give insight
Netflix Prize
AOL search dataset
Barabasi mobile study

DataMiningColumbiaUniversity

Case Study 1: AOL Search Data

August 4, 2006: AOL releases 20M search
terms by anonymized users for research
purposes.
Why?

Within hours, uproar on the blogs

The utter stupidity of this is staggering TechCrunch

August 7: AOL removes data, issues apology

this was a screw-up, and we are angry
an innocent enough attempt to reach out to
the research community

August 9: NYT front page story

Identifies Thelma Arnold, 62 year old widow

DataMiningColumbiaUniversity

Case Study 1: AOL Search Data

Whats the big deal?
Ego searches make it easy to figure out who you are combined with
porn or illegal queries can make for serious privacy violations.

What went wrong

Not well thought out : risk >> reward

Poor internal controls on public data release
Lack of understanding of subject matter
Lack of understanding of anonymizing data

Fallout
CTO + at least two others fired
Data still out in the public
Is it ethical to study?

Inspiration for bad drama

DataMiningColumbiaUniversity

purple lilac," "happy bunny

pictures, "square dancing
steps "cut into your trachea,"
"pee fetish, "Simpsons incest."

Case Study 2: Netflix Prize

October 2006: Netflix releases anonymized
movie ratings from its customer base
100M ratings, 500K customers (<10% of all
data)
Random integer as user ID
"some of the rating data for some customers in the
training and qualifying sets have been deliberately
perturbed in one or more of the following ways:
deleting ratings; inserting alternative ratings and
dates; and modifying rating dates

2007: Shock paper claiming deanonymization of Netflix Prize data

DataMiningColumbiaUniversity

Case Study 2: Netflix Prize

Narayanan and Shmatikov (2008)
The adversary with a small amount of background
knowledge about an individualcan identify with high
probability that individuals record in the data and
learnsensitive attributes
Claim that Netflix data sanitization not relevant
Accuse Netflix of violating Video Privacy Protection Act
of 1988
Details:
With aux info on 8 movies, where 2 can be wrong, and dates are
known within 14 days; 99% de-anonymization

Aux info can be gotten via web sites, water coolers, etc
People might be willing to give away some ratings, but not
others

DataMiningColumbiaUniversity

Case Study 2: Netflix Prize

Much ado about nothing
Although paper is technically correct, dates are key
Without dates, you must know 8 movies, all outside of the top
500 to get over 80% chance of de-anonymization
Auxiliary data very hard to come by
No known cases discovered

Netflix did it right

Consulted with top machine learning experts
0 < risk << reward
Investment in quality data and expertise
mitigated risk
DataMiningColumbiaUniversity

Case Study 3: Barabasi Mobile

Study
Gonzalez, Hidalgo and Barabasi (2008)
Article in Nature outlines study on human mobility patterns

100000 individuals selected randomly from dataset of 6 million

Unidentified country (unclear if the researchers knew)
Cell tower location at start of call
206 individuals were pinged every two hours for a week

Findings
humans follow simple, reproducible patterns
Sample finding: Nearly three-quarters of those studied mainly
stayed within a 20-mile-wide circle for half a year.
Results could impact all phenomena driven by human
mobility, from epidemic prevention to emergency response
and urban planning.

DataMiningColumbiaUniversity

Case Study 3: Barabasi Mobile

Study
Uproar ensued over secret tracking of cell phone users
Blowback of negative feedback to Nature and scientists
Study would be illegal in the US
Approval from ONR review board and Northeastern review
board. Barabasi did not check with an ethics panel

Response
Hidalgo: the data could be misused, but we were not
trying to do evil things. We are trying to make the world a
little better.
Northeastern and Nature backed the research
Continues to be referenced as an example of dangerous
research
Risk and reward both very high

DataMiningColumbiaUniversity

Research Concepts - Privacy

How do we guarantee that data is private?
quasi-identifiers combinations of attributes within the data that can be
used to identify individuals.
E.g. 87% of the population of the United States can be uniquely identified by
gender, date of birth, and 5-digit zip code
Datasets are k-anonymous when for any given quasi-identifier, a record is
indistinguishable from k-1 others.

But, one step further, maybe all k have a given sensitive attribute!
The distribution of target values within a group is referred to as l-diversity.

Ways to fuzz data to increase anonymity and diversity:

Generalize / summarize the data : bin size, aggregate counts
Suppress or delete data
Perturb data

Balance between privacy and utility is a hot research topic

DataMiningColumbiaUniversity

Data Mining and Ethics

Privacy is not the only issue data mining brings up
ethical issues as well
Can you use sexual and/or racial information for profiling?
Medical diagnosis?
Loan payments?
What about proxies for these things?

Best practices:

Full disclosure
Full transparency
Limited access to data
Opt-out
But: can we use data for the public good without informing
everyone?

DataMiningColumbiaUniversity

Dell Switch Hardening Guide
No ratings yet
Dell Switch Hardening Guide
4 pages
Policies
No ratings yet
Policies
13 pages
01 Intro
No ratings yet
01 Intro
45 pages
01 Intro
No ratings yet
01 Intro
23 pages
Chapter 1 - Tagged
No ratings yet
Chapter 1 - Tagged
46 pages
Introduction To Data Mining & Business Intelligence
No ratings yet
Introduction To Data Mining & Business Intelligence
25 pages
01 - Data Mining Introduction
No ratings yet
01 - Data Mining Introduction
21 pages
Week 01 Chapt01
No ratings yet
Week 01 Chapt01
49 pages
01 Intro
No ratings yet
01 Intro
29 pages
Data Mining Concepts
No ratings yet
Data Mining Concepts
35 pages
0 Introduction
No ratings yet
0 Introduction
43 pages
IS414: Data Mining: DR - Waleed M.Ead
No ratings yet
IS414: Data Mining: DR - Waleed M.Ead
36 pages
Data Mining Concepts and Techniques - Han, Kamber & Pei
No ratings yet
Data Mining Concepts and Techniques - Han, Kamber & Pei
953 pages
Chapter - 1
No ratings yet
Chapter - 1
22 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
Day-2 BE-VIII DMDW (Into. Contd..)
No ratings yet
Day-2 BE-VIII DMDW (Into. Contd..)
23 pages
Data Analysis-2
No ratings yet
Data Analysis-2
41 pages
DWDM-LS1-Fall-24-25
No ratings yet
DWDM-LS1-Fall-24-25
42 pages
Data Mining
No ratings yet
Data Mining
13 pages
Unit 3
No ratings yet
Unit 3
23 pages
01 Intro 1
No ratings yet
01 Intro 1
50 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
48 pages
21IS503 UnitII LM5
No ratings yet
21IS503 UnitII LM5
20 pages
01 Intro
No ratings yet
01 Intro
61 pages
intro data mining
No ratings yet
intro data mining
51 pages
Data Mining: Concepts and Techniques: Sujata Chakravarty Associate Professor RCMA, Bhubaneswar
No ratings yet
Data Mining: Concepts and Techniques: Sujata Chakravarty Associate Professor RCMA, Bhubaneswar
17 pages
DB-14
No ratings yet
DB-14
97 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
25 pages
1712060004 (1)
No ratings yet
1712060004 (1)
25 pages
Module 2 Data Mining
No ratings yet
Module 2 Data Mining
49 pages
1 DMiningKuliah 1 Introduction
No ratings yet
1 DMiningKuliah 1 Introduction
51 pages
01 Intro
No ratings yet
01 Intro
35 pages
Chapter1 Introduction (Autosaved)
No ratings yet
Chapter1 Introduction (Autosaved)
23 pages
VIPDMTheoryChapter1
No ratings yet
VIPDMTheoryChapter1
25 pages
Data Mining From Scratch
No ratings yet
Data Mining From Scratch
17 pages
DMiningKuliah 1 Introduction
No ratings yet
DMiningKuliah 1 Introduction
41 pages
01Intro
No ratings yet
01Intro
28 pages
DM Chapter 1
No ratings yet
DM Chapter 1
37 pages
Cse5243 Intro. To Data Mining: Chapter 1. Introduction
No ratings yet
Cse5243 Intro. To Data Mining: Chapter 1. Introduction
56 pages
01 Intro
No ratings yet
01 Intro
40 pages
Datamining Chapter 1 Introduction
No ratings yet
Datamining Chapter 1 Introduction
41 pages
Unit 1
No ratings yet
Unit 1
95 pages
01Intro (2)
No ratings yet
01Intro (2)
45 pages
DM-Unit 1 PPT
No ratings yet
DM-Unit 1 PPT
110 pages
Data Mining
No ratings yet
Data Mining
61 pages
Unit 3.1
No ratings yet
Unit 3.1
23 pages
datamining&warehousing
No ratings yet
datamining&warehousing
65 pages
Unit 1: Data Warehousing & Data Mining
No ratings yet
Unit 1: Data Warehousing & Data Mining
54 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
39 pages
Concepts and Techniques: - Chapter 1
No ratings yet
Concepts and Techniques: - Chapter 1
39 pages
Module 1
No ratings yet
Module 1
40 pages
LECTURE 1 data mining
No ratings yet
LECTURE 1 data mining
41 pages
Introduction-to-Data-Mining
No ratings yet
Introduction-to-Data-Mining
32 pages
01Intro.pptx
No ratings yet
01Intro.pptx
40 pages
01Intro
No ratings yet
01Intro
41 pages
Lecture 1. Introduction
No ratings yet
Lecture 1. Introduction
42 pages
Combine 056
No ratings yet
Combine 056
57 pages
DM Day1 Intro MS F24 (1)
No ratings yet
DM Day1 Intro MS F24 (1)
111 pages
1 Chapter One
No ratings yet
1 Chapter One
54 pages
CIS 467 - Topic 1 - Introduction - 2020
No ratings yet
CIS 467 - Topic 1 - Introduction - 2020
79 pages
_01Intro_edited_v1
No ratings yet
_01Intro_edited_v1
42 pages
Mastering Data Mining Techniques
From Everand
Mastering Data Mining Techniques
Dhaanyalakshmi Ahuja
No ratings yet
Executive Information System (EIS) Definition: Got A Question On This Topic?
No ratings yet
Executive Information System (EIS) Definition: Got A Question On This Topic?
4 pages
SAP HANA Sizing Simplified Level 2 Quiz
No ratings yet
SAP HANA Sizing Simplified Level 2 Quiz
20 pages
Dump - Versa Networks VNX100 - Versa Certified SD-WAN Associate Exam
No ratings yet
Dump - Versa Networks VNX100 - Versa Certified SD-WAN Associate Exam
21 pages
Chapter 2. Network Models: Layered Tasks The OSI Model Layers in The OSI Model TCP/IP Protocol Suite Addressing
100% (1)
Chapter 2. Network Models: Layered Tasks The OSI Model Layers in The OSI Model TCP/IP Protocol Suite Addressing
28 pages
Attendance
No ratings yet
Attendance
4 pages
Diff Between Delete, Drop and Truncate SQL
No ratings yet
Diff Between Delete, Drop and Truncate SQL
6 pages
Ey The Software Driven Revolution Redefining The Automotive Industry
No ratings yet
Ey The Software Driven Revolution Redefining The Automotive Industry
21 pages
Requirement Analysis and Modeling
No ratings yet
Requirement Analysis and Modeling
6 pages
Mini Project
No ratings yet
Mini Project
60 pages
SOC Analyst
No ratings yet
SOC Analyst
28 pages
Datebase System Concepts
No ratings yet
Datebase System Concepts
8 pages
Cybersecurity Standards Cloud Access
No ratings yet
Cybersecurity Standards Cloud Access
18 pages
Port Scanning Test
No ratings yet
Port Scanning Test
5 pages
Recommended Book List For CCIE Security Candidates
No ratings yet
Recommended Book List For CCIE Security Candidates
2 pages
Module 3
No ratings yet
Module 3
138 pages
Payout Management System
No ratings yet
Payout Management System
70 pages
Solaris Volume Manager Administration
No ratings yet
Solaris Volume Manager Administration
2 pages
CV Lorenta Debora Hasibuan
No ratings yet
CV Lorenta Debora Hasibuan
1 page
Powershell To Csharp Sample
No ratings yet
Powershell To Csharp Sample
76 pages
Center: Aswan Database Assignment: B) Database Management System
No ratings yet
Center: Aswan Database Assignment: B) Database Management System
14 pages
N Etiquette
No ratings yet
N Etiquette
21 pages
How To Analyze A PDF With The Layout-Parser Package. - by Brendan Ferris - Towards Data Science
No ratings yet
How To Analyze A PDF With The Layout-Parser Package. - by Brendan Ferris - Towards Data Science
3 pages
ATS Friendly Resume
No ratings yet
ATS Friendly Resume
2 pages
Common Errors in Sm21
No ratings yet
Common Errors in Sm21
1 page
Skill Vertex
No ratings yet
Skill Vertex
3 pages
Cyber Security Awareness
100% (1)
Cyber Security Awareness
23 pages
Paper Published
No ratings yet
Paper Published
12 pages
Cissp Passport
No ratings yet
Cissp Passport
421 pages