0% found this document useful (0 votes)
64 views

Topic 1: Introduction To Data Mining: Instructor: Chris Volinsky

This document provides an introduction to a data mining course. It outlines the course objectives, which include learning data mining techniques and their applications, understanding the limitations of standard statistical techniques, and implementing data mining models using statistical software. The document also describes the course assignments, which include homework, exams, and a semester-long data mining project where students define a question, collect and analyze data, and write a report on their findings. Key data mining software and resources are also introduced.

Uploaded by

Salom
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views

Topic 1: Introduction To Data Mining: Instructor: Chris Volinsky

This document provides an introduction to a data mining course. It outlines the course objectives, which include learning data mining techniques and their applications, understanding the limitations of standard statistical techniques, and implementing data mining models using statistical software. The document also describes the course assignments, which include homework, exams, and a semester-long data mining project where students define a question, collect and analyze data, and write a report on their findings. Key data mining software and resources are also introduced.

Uploaded by

Salom
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 59

Topic 1:

Introduction to Data Mining


Instructor: Chris Volinsky

DataMiningColumbiaUniversity

Intro
Who am I?

Who are you?

DataMiningColumbiaUniversity

Class Schedule
Sept 8 December 8
No class Election Day or Thanksgiving

Syllabus:
www.research.att.com/~volinsky/DataMining/Columbia2011/Co
lumbia2011.html
My email: [email protected]
My phone: 973-360-8644
My office hours: by appointment before or after class
DataMiningColumbiaUniversity

Class Assessment
30% HW
Due every two weeks
1st HW due next Thursday September 15
No late HW accepted

40% Tests
Midterm and Final

30% Data Mining Project


Proposal due in October
Project due Tuesday Dec 13
DataMiningColumbiaUniversity

Course Objectives
Direct Objectives:
To learn data mining techniques
To see their use in real-world/research applications
To understand limitations of standard statistical
techniques in data mining applications
To get an understanding of the methodological
principles behind data mining
To be able to read about data mining in the popular
press with a critical eye
To implement & use data mining models using
statistical software

DataMiningColumbiaUniversity

Data Analysis Project


The goal of data mining is to find interesting patterns
in data. You will be required to:

Define a scientific question of interest


Collect a data set n>1000 (probably online)
Prepare the data set properly
Analyze the data using appropriate models
Write a 10-20 page report on your analysis (graphics
included)

Project proposals (1/2 -1 page) will be due in early


October.
Volunteers to present projects in class for extra
credit.
Finished reports will be due December 13.
DataMiningColumbiaUniversity

Data Mining Software


Software
Can use any software you like must know how to input,
manipulate, graph, and analyze data.
Preferred: R
Also: SAS, Weka, SPSS, Systat, Enterprise Miner, JMP, Minitab,
Matlab, SQL Server
Maybe not: Excel, C

What is R?

Open source statistical software grown out of S/Splus


www.r-project.org
Many user-contributed packages at CRAN (cran.r-project.org)
Active, helpful user community (help lists, bulletin boards, etc)
R Tutorials available online (see class website and CRAN)
Great graphics (with a bit of a learning curve)

Other useful tools: Perl/Python, AWK, Shell scripts

DataMiningColumbiaUniversity

Resources
Data mining is a new field and as such, does not have
authoritative texts (yet).
This class draws from many sources, best are
Elements of Statistical Learning Hastie, Tibshirani, and
Friedman
Handbook of Data Mining Hand, Mannila and Smyth
Interactive and Dynamic Graphics for Data Analysis Cook and
Swayne
Data Mining Practical Machine Learning Tools and
Techniques Witten and Frank
Also good class notes available from other classes:

David Madigan, Columbia


Di Cook, Iowa State
Padhraic Smyth, UC Irvine
Jiawei Han, Simon Fraser

see class web site for pointers to these notes, or just Google them!)

Also a few good books which teach stats/DM through R:

The R Book Crawley


A Handbook of Statistical Analyses Using R Evirtt and Hothorn
Modern Applied Statistics Using S-Plus Venables and Ripley

DataMiningColumbiaUniversity

Course Outline
Each unit covers two lectures
Units:

Intro to Data Mining


Data exploration and visualization
Data Mining Concepts
Regression Topics
Classification and Supervised Learning
Clustering and Unsupervised Learning
Text Mining and Information Retrieval
Web Mining
Social Networks
Assorted Topics
Advanced Classification Neural networks, Support Vector
machines
Ensemble methods
Recommender Systems
Fraud

DataMiningColumbiaUniversity

What is Data Mining?


Not well defined.
No one can agree on what data mining is! In fact
the experts have very different descriptions:
finding interesting structure (patterns, statistical models,
relationships) in data bases. - Fayyad, Chaduriand
the nontrivial process of identifying valid, novel, potentially
useful, and ultimately understandable patterns in data. Fayyad
a knowledge discovery process of extracting previously
unknown, actionable information from very large data bases
Zorne
a process that uses a variety of data analysis tools to discover
patterns and relationships in data that may be used to make
valid predictions.--- Edelstein

DataMiningColumbiaUniversity

10

What is Data Mining


From Zaiane:
Data Mining, also popularly known as Knowledge Discovery in
Databases (KDD)...
The Knowledge Discovery in Databases process comprises of a
few steps leading from raw data collections to some form of new
knowledge. The iterative process consists of the following steps:

Data cleaning: ...


Data integration: ...
Data selection: ...
Data transformation: ...
Data mining: it is the crucial step in which clever techniques are applied
to extract patterns potentially useful.
Pattern evaluation: ...
Knowledge representation: ...

DataMiningColumbiaUniversity

11

What is Data Mining?


What does the authority say?
Data mining is the process of extracting hidden patterns from
data.
Data mining is the process of discovering new patterns from
large data sets involving methods from statistics and artificial
intelligence but also database management.

Hand, Mannila, Smyth:


data mining is the analysis of (often large) observational data sets to find
unsuspected relationships and to summarize the data in novel ways that are
both understandable and useful to the data owner

Isnt that the same as statistics?

DataMiningColumbiaUniversity

12

Data Mining vs. Statistics


Snark: Data Mining = Statistics + Marketing
Statistics is known for:

well defined hypotheses used to learn about a


specifically chosen population studied using
carefully collected data providing inferences with
well known properties.

Data mining isnt that careful. It is:

data driven discovery of


models and patterns from
massive and
observational data sets
DataMiningColumbiaUniversity

13

Data Mining v. Statistics


Traditional statistics

first hypothesize, then collect data, then analyze


often model-oriented (strong parametric models)
Focused on understanding

Data mining (also Machine Learning):

few if any a priori hypotheses


data is usually already collected a priori
analysis is typically data-driven not hypothesis-driven
Often algorithm-oriented rather than model-oriented
Focused on prediction

But

statistical ideas are very useful in data mining, e.g., in validating whether
discovered knowledge is useful
Increasing overlap at the boundary of statistics and DM
Cultures could learn from each other
Very powerful when used together

DataMiningColumbiaUniversity

14

Data Mining Enablers


Explosion of data
Fast and cheap computation and storage
Moores Law: processing doubles every two years
Disk storage doubles every 9 months
Database technology

Competitive pressure in business


Data has value! Successes are widely publicized

Commercial products
SAS, SPSS, Google Analytics, IBM, Oracle

Open Source products


Weka
R

Dont need a data mining expert to do data


mining!
DataMiningColumbiaUniversity

15

Data-Driven Discovery
Observational data
cheap relative to experimental data
Examples:
Retail stores, airlines, etc
Amazon, Google, etc
Do iPhone users use more data than Android users?

makes sense to leverage available, observational


data

What are the perils of observational data?


Easy to do pseudo-experiments
Observational data can also help in
hypothesis formulation.
DataMiningColumbiaUniversity

16

Data Mining: Confluence of Multiple


Disciplines
Database
Technology

Machine
Learning

Statistics

Data Mining

Information
Science

Visualization

Other
Disciplines

Different fields have different views of what data mining is


(also different terminology!)
DataMiningColumbiaUniversity

17

Data Data Data


Its all about the data - where does it
come from?

www
NASA
Business processes/transactions
Telecommunications and networking
Medical imagery
Government, census, demographics
(data.gov!)
Sensor networks, RFID tags
sports
DataMiningColumbiaUniversity

18

Types of Data: Flat File or Vector


Data

n
p
Rows = objects
Columns = measurements on objects

Represent each row as a p-dimensional vector, where p is the


dimensionality
In efffect, embed our objects in a p-dimensional vector space

Both n and p can be very large in data mining (also p>>n)


Matrix can be quite sparse
DataMiningColumbiaUniversity

19

Types of Data: TextData

Canbe
representedasa
sparsematrix

Obama

Text
Documents

The Help

Word IDs
DataMiningColumbiaUniversity

20

Transactional Data
Datestampedevents(weblogs,phonecalls):
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,

Canberepresentedasatimeseries:
User 1
User 2
User 3
User 4
User 5

2
3
7
1
5

3
3
7
5
1

2
3
7
1
1

2
1
7
1
5

3
1
7
1

3 3 1 1 1 3 1 3 3 3 3
1
7 7 7
5 1 5 1 1 1 1 1 1

DataMiningColumbiaUniversity

21

TypesofData:Relational Data
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
,

128.195.36.195, Doe, John, 12 Main St, 973-462-3421, Madison, NJ, 07932


114.12.12.25,Trank, Jill, 11 Elm St, 998-555-5675, Chester, NJ, 07911

07911, Chester, NJ, 07954, 34000, , 40.65, -74.12


07932, Madison, NJ, 56000, 40.642, -74.132

Mostlargedatasetsarestoredinrelationaldatasets
Specialdataquerylanguage:SQL
Oracle,MSFT,IBM
Goodopensourceversions:MySQL,PostGres
DataMiningColumbiaUniversity

22

Types of Data: Time Series Data


Oftenmanytime
series,longtimeseries,
ormultivariatetime
series

DataMiningColumbiaUniversity

23

Time Series: Ebay Data

Jank,Shmueli,etal(2005)

DataMiningColumbiaUniversity

24

Types of Data: Image Data

DataMiningColumbiaUniversity

25

Spatio Temporal
Data

http://senseable.mit.edu/nyte/movies/nyte-globe-encounters.movencounters.mov

DataMiningColumbiaUniversity

26

Network Data: Physical Network

DataMiningColumbiaUniversity

27

Network Data: Derived Social


Network

Algorithms for estimating relative importance in networks


S. White and P. Smyth, ACM SIGKDD, 2003.
DataMiningColumbiaUniversity

28

Social Network: Real social


network

HP Labs email
network
500 people, 20k
relationships

DataMiningColumbiaUniversity

29

Examples of Data Mining


Successes
Market Basket (WalMart)
Recommender Systems (Amazon.com)
Fraud Detection in Telecommunications
(AT&T)
Target Marketing / CRM
Financial Markets
DNA Microarray analysis (or is it?)
Web Traffic / Blog analysis

DataMiningColumbiaUniversity

30

Examples of Data Mining


Successes

Google is a company built on data mining


PageRank mined the web to build better
search
Google as spell checker
Google as ad placer
Google as news aggregator
Google as face recognizer
DataMiningColumbiaUniversity

31

The Data Mining Process


Often called KDD - Knowledge
Discovery in Databases
Analysis is just one part of the process

Data collection and storage


Data cleaning
Data sampling
Analysis
Decision making

DataMiningColumbiaUniversity

32

Different Data Mining Tasks


Exploratory Data Analysis
Descriptive Modeling
Predictive Modeling
Discovering Patterns and Rules
+ others.
DataMiningColumbiaUniversity

33

Exploratory Data Analysis


Before you model what do you do?
Must check your data

Compute summary statistics: range, max, min, mean,


median, variance, skewness,..

Missing values, outliers, skewness, etc


What types of variables do you have?

Visualization is widely used


1d histograms
2d scatter plots
Higher-dimensional methods

Simple exploratory analysis can be extremely


valuable

Always look at your data before applying any data


mining algorithms
DataMiningColumbiaUniversity

34

Example of Exploratory Data Analysis


LanguagesoftheWorldWideWebGoogleResearchBlogJuly,2011

DataMiningColumbiaUniversity

35

Descriptive Modeling
Goal is to build a descriptive model
e.g., a model that could simulate the data if
needed
models the underlying process

Examples:
Density estimation:
estimate the joint distribution P(x1,xp)

Cluster analysis:
Find natural groups in the data

Dependency models among the p variables


Learning a Bayesian network for the data
DataMiningColumbiaUniversity

36

Example of Descriptive Modeling


Hemoglobinvs.cellvolume

Control Group

Anemia Group

DataMiningColumbiaUniversity

37

Example of Descriptive Modeling

Control Group

Anemia Group

DataMiningColumbiaUniversity

38

Predictive Modeling
Predict one variable Y given a set of other variables X
Here X could be a p-dimensional vector
Classification: Y is categorical
Regression: Y is real-valued

In effect this is function approximation, learning the


relationship between Y and X

In data mining, the emphasis is on predictive accuracy,


not on understanding the model

DataMiningColumbiaUniversity

39

Predictive Modeling: Fraud


Detection
Telecommunications fraud detection

Fraud costs companies US$ Billions per year


very few transactions are fraudulent, but they are costly

Approach

For each transaction estimate fraudiness.


Based on known fraud AND known user behavior
High probability cases investigated by fraud police

Example models:

Credit card usage profiling

anomaly detection
guilt by association

DataMiningColumbiaUniversity

40

Pattern Discovery
Goal is to discover interesting local patterns in
the data rather than to characterize the data
globally
given market basket data we might discover that
If customers buy wine and bread then they buy cheese with
probability 0.9
These are known as association rules
This was how data mining was born.
But I dont like it

Other examples:
Astronomy
Finance

DataMiningColumbiaUniversity

41

Example of Pattern Discovery


IBM Advanced Scout System
Bhandari et al. (1997)
Every NBA basketball game is annotated,
e.g., time = 6 mins, 32 seconds
event = 3 point basket
player = Michael Jordan
This creates a huge untapped database of information

IBM algorithms search for rules of the form


If player A is in the game, player Bs scoring rate
increases from 3.2 points per quarter to 8.7 points
per quarter

DataMiningColumbiaUniversity

42

Data Mining Pitfalls


Is data mining always necessary
Just because you have a terabyte doesnt
mean you need to use it.

Privacy concerns
Differ by country, industry, application,
generation

Meaningfulness of patterns unclear


Rhine paradox
Terrorism
DM has a lot to learn from statistics!
DataMiningColumbiaUniversity

43

Rhine Paradox
David Rhine: parapsychologist who studied
ESP (he was a believer!)
He devised an experiment where subjects were
asked to guess 10 hidden cards --- red or blue.
Reported: 1 in 1000 people have ESP
He told these people they had ESP and called
them in for another test of the same type.
What do you think happened?
What is the conclusion?

DataMiningColumbiaUniversity

44

Data Mining Pitfalls


PR Problems: data mining as a four letter word?
...increasingly peoples data is at risk. The old ways ...are still at
use like dumpster diving, stealing from mailboxes, physical theft,
and credit card receipt copying. New tactics include disparate
techniques of phishing, email fraud, data mining, spam, keylogging and an array of other technological processes. - Steven D.
Domenikos, IdentityTruth, 2008

One place oversight is sorely lacking is in the whole matter of data


mining. ...What have they contributed? Not a single case comes to
mind in which security services apprehended a terrorist following
identification by data mining. ...that huge database will be out
there, win or lose, for some government agency to divert to its
purposes or some hacker to turn to private gain or crime. - John
Prados, TomPaine.com

DataMiningColumbiaUniversity

45

Fighting Terrorism in the US


US Government is widely known to be collecting lots of
data on Americans and using data mining to look for
patterns consistent with terrorist activity.
Bruce Schneier, Wired Magazine, Why Data Mining Wont
Stop Terror:
Assume:
1 in 100 false positive (99% precision)
1 in 1000 false negative
1 trillion events (phone calls, credit card transactions, emails)
per day
10 are really terrorist plots

Then:
1 billion false alarms for every true plot uncovered
27 million leads daily
Even if 99.9999% precision = 2,750 false alarms
DataMiningColumbiaUniversity

46

Data Mining v. Privacy


There is often tension between data
mining and personal privacy:
http://www.aclu.org/pizza/images/scree
n.swf

Now, some case studies.


DataMiningColumbiaUniversity

47

Risk v. Reward in Data Mining


More data about more people in fewer places

DataMiningColumbiaUniversity

48

The risks of research


My own personal story:
orhow a paper published in JCGS leads
me to be connected to FBI wiretapping.

2001-2005: Publish papers on Communities of Interest using


social networks and Guilt by association to catch fraud
9 September 2007: NYT lead story F.B.I. Data Mining Reached
Beyond Initial Targets discusses FBI techniques COI and GBA
23 October 2007: Blogosphere erupts: How AT&T Provides the FBI
with Terror Suspect Leads
DataMiningColumbiaUniversity

49

The Good, The Bad, and the Maybe


The question remains: how do we
effectively leverage sensitive personal
data for research purposes?
Three case studies can give insight
Netflix Prize
AOL search dataset
Barabasi mobile study

DataMiningColumbiaUniversity

50

Case Study 1: AOL Search Data


August 4, 2006: AOL releases 20M search
terms by anonymized users for research
purposes.
Why?

Within hours, uproar on the blogs


The utter stupidity of this is staggering TechCrunch

August 7: AOL removes data, issues apology


this was a screw-up, and we are angry
an innocent enough attempt to reach out to
the research community

August 9: NYT front page story


Identifies Thelma Arnold, 62 year old widow

DataMiningColumbiaUniversity

51

Case Study 1: AOL Search Data


Whats the big deal?
Ego searches make it easy to figure out who you are combined with
porn or illegal queries can make for serious privacy violations.

What went wrong

Not well thought out : risk >> reward


Poor internal controls on public data release
Lack of understanding of subject matter
Lack of understanding of anonymizing data

Fallout
CTO + at least two others fired
Data still out in the public
Is it ethical to study?

Inspiration for bad drama

DataMiningColumbiaUniversity

purple lilac," "happy bunny


pictures, "square dancing
steps "cut into your trachea,"
"pee fetish, "Simpsons incest."

52

Case Study 2: Netflix Prize


October 2006: Netflix releases anonymized
movie ratings from its customer base
100M ratings, 500K customers (<10% of all
data)
Random integer as user ID
"some of the rating data for some customers in the
training and qualifying sets have been deliberately
perturbed in one or more of the following ways:
deleting ratings; inserting alternative ratings and
dates; and modifying rating dates

2007: Shock paper claiming deanonymization of Netflix Prize data


DataMiningColumbiaUniversity

53

Case Study 2: Netflix Prize


Narayanan and Shmatikov (2008)
The adversary with a small amount of background
knowledge about an individualcan identify with high
probability that individuals record in the data and
learnsensitive attributes
Claim that Netflix data sanitization not relevant
Accuse Netflix of violating Video Privacy Protection Act
of 1988
Details:
With aux info on 8 movies, where 2 can be wrong, and dates are
known within 14 days; 99% de-anonymization

Aux info can be gotten via web sites, water coolers, etc
People might be willing to give away some ratings, but not
others

DataMiningColumbiaUniversity

54

Case Study 2: Netflix Prize


Much ado about nothing
Although paper is technically correct, dates are key
Without dates, you must know 8 movies, all outside of the top
500 to get over 80% chance of de-anonymization
Auxiliary data very hard to come by
No known cases discovered

Netflix did it right


Consulted with top machine learning experts
0 < risk << reward
Investment in quality data and expertise
mitigated risk
DataMiningColumbiaUniversity

55

Case Study 3: Barabasi Mobile


Study
Gonzalez, Hidalgo and Barabasi (2008)
Article in Nature outlines study on human mobility patterns

100000 individuals selected randomly from dataset of 6 million


Unidentified country (unclear if the researchers knew)
Cell tower location at start of call
206 individuals were pinged every two hours for a week

Findings
humans follow simple, reproducible patterns
Sample finding: Nearly three-quarters of those studied mainly
stayed within a 20-mile-wide circle for half a year.
Results could impact all phenomena driven by human
mobility, from epidemic prevention to emergency response
and urban planning.

DataMiningColumbiaUniversity

56

Case Study 3: Barabasi Mobile


Study
Uproar ensued over secret tracking of cell phone users
Blowback of negative feedback to Nature and scientists
Study would be illegal in the US
Approval from ONR review board and Northeastern review
board. Barabasi did not check with an ethics panel

Response
Hidalgo: the data could be misused, but we were not
trying to do evil things. We are trying to make the world a
little better.
Northeastern and Nature backed the research
Continues to be referenced as an example of dangerous
research
Risk and reward both very high

DataMiningColumbiaUniversity

57

Research Concepts - Privacy


How do we guarantee that data is private?
quasi-identifiers combinations of attributes within the data that can be
used to identify individuals.
E.g. 87% of the population of the United States can be uniquely identified by
gender, date of birth, and 5-digit zip code
Datasets are k-anonymous when for any given quasi-identifier, a record is
indistinguishable from k-1 others.

But, one step further, maybe all k have a given sensitive attribute!
The distribution of target values within a group is referred to as l-diversity.

Ways to fuzz data to increase anonymity and diversity:


Generalize / summarize the data : bin size, aggregate counts
Suppress or delete data
Perturb data

Balance between privacy and utility is a hot research topic

DataMiningColumbiaUniversity

58

Data Mining and Ethics


Privacy is not the only issue data mining brings up
ethical issues as well
Can you use sexual and/or racial information for profiling?
Medical diagnosis?
Loan payments?
What about proxies for these things?

Best practices:

Full disclosure
Full transparency
Limited access to data
Opt-out
But: can we use data for the public good without informing
everyone?

DataMiningColumbiaUniversity

59

You might also like