Topic 1: Introduction To Data Mining: Instructor: Chris Volinsky
Topic 1: Introduction To Data Mining: Instructor: Chris Volinsky
DataMiningColumbiaUniversity
Intro
Who am I?
DataMiningColumbiaUniversity
Class Schedule
Sept 8 December 8
No class Election Day or Thanksgiving
Syllabus:
www.research.att.com/~volinsky/DataMining/Columbia2011/Co
lumbia2011.html
My email: [email protected]
My phone: 973-360-8644
My office hours: by appointment before or after class
DataMiningColumbiaUniversity
Class Assessment
30% HW
Due every two weeks
1st HW due next Thursday September 15
No late HW accepted
40% Tests
Midterm and Final
Course Objectives
Direct Objectives:
To learn data mining techniques
To see their use in real-world/research applications
To understand limitations of standard statistical
techniques in data mining applications
To get an understanding of the methodological
principles behind data mining
To be able to read about data mining in the popular
press with a critical eye
To implement & use data mining models using
statistical software
DataMiningColumbiaUniversity
What is R?
DataMiningColumbiaUniversity
Resources
Data mining is a new field and as such, does not have
authoritative texts (yet).
This class draws from many sources, best are
Elements of Statistical Learning Hastie, Tibshirani, and
Friedman
Handbook of Data Mining Hand, Mannila and Smyth
Interactive and Dynamic Graphics for Data Analysis Cook and
Swayne
Data Mining Practical Machine Learning Tools and
Techniques Witten and Frank
Also good class notes available from other classes:
see class web site for pointers to these notes, or just Google them!)
DataMiningColumbiaUniversity
Course Outline
Each unit covers two lectures
Units:
DataMiningColumbiaUniversity
DataMiningColumbiaUniversity
10
DataMiningColumbiaUniversity
11
DataMiningColumbiaUniversity
12
13
But
statistical ideas are very useful in data mining, e.g., in validating whether
discovered knowledge is useful
Increasing overlap at the boundary of statistics and DM
Cultures could learn from each other
Very powerful when used together
DataMiningColumbiaUniversity
14
Commercial products
SAS, SPSS, Google Analytics, IBM, Oracle
15
Data-Driven Discovery
Observational data
cheap relative to experimental data
Examples:
Retail stores, airlines, etc
Amazon, Google, etc
Do iPhone users use more data than Android users?
16
Machine
Learning
Statistics
Data Mining
Information
Science
Visualization
Other
Disciplines
17
www
NASA
Business processes/transactions
Telecommunications and networking
Medical imagery
Government, census, demographics
(data.gov!)
Sensor networks, RFID tags
sports
DataMiningColumbiaUniversity
18
n
p
Rows = objects
Columns = measurements on objects
19
Canbe
representedasa
sparsematrix
Obama
Text
Documents
The Help
Word IDs
DataMiningColumbiaUniversity
20
Transactional Data
Datestampedevents(weblogs,phonecalls):
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.101, -, 3/22/00, 16:18:50, W3SVC, SRVR1, 128.200.39.181, 60, 425, 72, 304, 0, GET, /top.html, -,
128.195.36.101, -, 3/22/00, 16:18:58, W3SVC, SRVR1, 128.200.39.181, 8322, 527, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.101, -, 3/22/00, 16:18:59, W3SVC, SRVR1, 128.200.39.181, 0, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:54:37, W3SVC, SRVR1, 128.200.39.181, 140, 199, 875, 200, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 17766, 365, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:54:55, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:39, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:03, W3SVC, SRVR1, 128.200.39.181, 1081, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:56:04, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:56:33, W3SVC, SRVR1, 128.200.39.181, 0, 262, 72, 304, 0, GET, /top.html, -,
128.200.39.17, -, 3/22/00, 20:56:52, W3SVC, SRVR1, 128.200.39.181, 19598, 382, 414, 200, 0, POST, /spt/main.html, -,
Canberepresentedasatimeseries:
User 1
User 2
User 3
User 4
User 5
2
3
7
1
5
3
3
7
5
1
2
3
7
1
1
2
1
7
1
5
3
1
7
1
3 3 1 1 1 3 1 3 3 3 3
1
7 7 7
5 1 5 1 1 1 1 1 1
DataMiningColumbiaUniversity
21
TypesofData:Relational Data
128.200.39.17, -, 3/22/00, 20:55:07, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 1061, 382, 414, 200, 0, POST, /spt/main.html, -,
128.200.39.17, -, 3/22/00, 20:55:36, W3SVC, SRVR1, 128.200.39.181, 0, 258, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
128.195.36.195, -, 3/22/00, 10:35:11, W3SVC, SRVR1, 128.200.39.181, 781, 363, 875, 200, 0, GET, /top.html, -,
128.195.36.195, -, 3/22/00, 10:35:16, W3SVC, SRVR1, 128.200.39.181, 5288, 524, 414, 200, 0, POST, /spt/main.html, -,
128.195.36.195, -, 3/22/00, 10:35:17, W3SVC, SRVR1, 128.200.39.181, 30, 280, 111, 404, 3, GET, /spt/images/bk1.jpg, -,
,
Mostlargedatasetsarestoredinrelationaldatasets
Specialdataquerylanguage:SQL
Oracle,MSFT,IBM
Goodopensourceversions:MySQL,PostGres
DataMiningColumbiaUniversity
22
DataMiningColumbiaUniversity
23
Jank,Shmueli,etal(2005)
DataMiningColumbiaUniversity
24
DataMiningColumbiaUniversity
25
Spatio Temporal
Data
http://senseable.mit.edu/nyte/movies/nyte-globe-encounters.movencounters.mov
DataMiningColumbiaUniversity
26
DataMiningColumbiaUniversity
27
28
HP Labs email
network
500 people, 20k
relationships
DataMiningColumbiaUniversity
29
DataMiningColumbiaUniversity
30
31
DataMiningColumbiaUniversity
32
33
34
DataMiningColumbiaUniversity
35
Descriptive Modeling
Goal is to build a descriptive model
e.g., a model that could simulate the data if
needed
models the underlying process
Examples:
Density estimation:
estimate the joint distribution P(x1,xp)
Cluster analysis:
Find natural groups in the data
36
Control Group
Anemia Group
DataMiningColumbiaUniversity
37
Control Group
Anemia Group
DataMiningColumbiaUniversity
38
Predictive Modeling
Predict one variable Y given a set of other variables X
Here X could be a p-dimensional vector
Classification: Y is categorical
Regression: Y is real-valued
DataMiningColumbiaUniversity
39
Approach
Example models:
anomaly detection
guilt by association
DataMiningColumbiaUniversity
40
Pattern Discovery
Goal is to discover interesting local patterns in
the data rather than to characterize the data
globally
given market basket data we might discover that
If customers buy wine and bread then they buy cheese with
probability 0.9
These are known as association rules
This was how data mining was born.
But I dont like it
Other examples:
Astronomy
Finance
DataMiningColumbiaUniversity
41
DataMiningColumbiaUniversity
42
Privacy concerns
Differ by country, industry, application,
generation
43
Rhine Paradox
David Rhine: parapsychologist who studied
ESP (he was a believer!)
He devised an experiment where subjects were
asked to guess 10 hidden cards --- red or blue.
Reported: 1 in 1000 people have ESP
He told these people they had ESP and called
them in for another test of the same type.
What do you think happened?
What is the conclusion?
DataMiningColumbiaUniversity
44
DataMiningColumbiaUniversity
45
Then:
1 billion false alarms for every true plot uncovered
27 million leads daily
Even if 99.9999% precision = 2,750 false alarms
DataMiningColumbiaUniversity
46
47
DataMiningColumbiaUniversity
48
49
DataMiningColumbiaUniversity
50
DataMiningColumbiaUniversity
51
Fallout
CTO + at least two others fired
Data still out in the public
Is it ethical to study?
DataMiningColumbiaUniversity
52
53
Aux info can be gotten via web sites, water coolers, etc
People might be willing to give away some ratings, but not
others
DataMiningColumbiaUniversity
54
55
Findings
humans follow simple, reproducible patterns
Sample finding: Nearly three-quarters of those studied mainly
stayed within a 20-mile-wide circle for half a year.
Results could impact all phenomena driven by human
mobility, from epidemic prevention to emergency response
and urban planning.
DataMiningColumbiaUniversity
56
Response
Hidalgo: the data could be misused, but we were not
trying to do evil things. We are trying to make the world a
little better.
Northeastern and Nature backed the research
Continues to be referenced as an example of dangerous
research
Risk and reward both very high
DataMiningColumbiaUniversity
57
But, one step further, maybe all k have a given sensitive attribute!
The distribution of target values within a group is referred to as l-diversity.
DataMiningColumbiaUniversity
58
Best practices:
Full disclosure
Full transparency
Limited access to data
Opt-out
But: can we use data for the public good without informing
everyone?
DataMiningColumbiaUniversity
59