Lecture 2
Lecture 2
2
Historical Note:
Many Names of Data Mining
v Data Fishing, Data Dredging: 1960-
q used by Statistician (as bad name)
v Data Mining :1990 --
q used DB, business
q in 2003 – bad image because of TIA
v Knowledge Discovery in Databases (1989-)
q used by AI, Machine Learning Community
v also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...
3
Data vs. information
4
Trends leading to Data Flood
This image cannot currently be displayed.
v More data is
generated:
q Bank, telecom,
other business
transactions ...
q Scientific data:
astronomy,
biology, etc
q Web, text, and e-
commerce
5
Big Data Examples
v Europe's Very Long Baseline Interferometry (VLBI) has
16 telescopes, each of which produces 1
Gigabit/second of astronomical data over a 25-day
observation session
q storage and analysis a big problem
v AT&T handles billions of calls per day
q so much data, it cannot be all stored -- analysis
has to be done “on the fly”, on streaming data
6
Largest databases in 2003
v Commercial databases:
q Winter Corp. 2003 Survey: France Telecom
has largest decision-support DB, ~30TB; AT&T
~ 26 TB
v Web
q Alexa internet archive: 7 years of data, 500 TB
q Google searches 4+ Billion pages, many
hundreds TB
q IBM WebFountain, 160 TB (2003)
q Internet Archive (www.archive.org),~ 300 TB
7
5 million terabytes created in 2002
8
Data Growth Rate
9
Machine Learning / Data Mining
Application areas
v Science
q astronomy, bioinformatics, drug discovery, …
v Business
q advertising, CRM (Customer Relationship
management), investments, manufacturing,
sports/entertainment, telecom, e-Commerce,
targeted marketing, health care, …
v Web:
q search engines, bots, …
v Government
q profiling tax cheaters, anti-terror(?)
10
Data Mining for Customer
Modeling
v Customer Tasks:
q attrition prediction
q targeted marketing:
§ cross-sell, customer acquisition
q credit-risk
q fraud detection
v Industries
q banking, telecom, retail sales, …
11
Customer Attrition: Case Study
12
Customer Attrition Results
13
Assessing Credit Risk: Case Study
14
Credit Risk - Results
15
Successful e-commerce – Case Study
16
Genomic Microarrays – Case Study
17
Example: ALL/AML data
ALL AML
19
Problems Suitable for Data-Mining
20
Information is crucial
21
Data mining
v Extracting
q implicit,
q previously unknown,
q potentially useful
information from data
v Needed: programs that detect patterns and regularities
in the data
v Strong patterns Þ good predictions
q Problem 1: most patterns are not interesting
q Problem 2: patterns may be inexact (or
spurious)
q Problem 3: data may be garbled or missing
22
What is Data Mining?
vMany Definitions
q Non-trivial extraction of implicit, previously
unknown and potentially useful information
from data
q Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns
23
Data Mining Tasks
v Prediction Methods
q Use some variables to predict unknown
or future values of other variables.
v Description Methods
q Find human-interpretable patterns that
describe the data.
24
Major Data Mining Tasks
v Classification: predicting an item class
v Clustering: finding clusters in data
v Associations: e.g. A & B & C occur frequently
v Visualization: to facilitate human discovery
v Summarization: describing a group
v Deviation Detection: finding changes
v Estimation: predicting a continuous value
v Link Analysis: finding relationships
v…
25
Data Mining Tasks:
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances
Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...
26
Classification: Definition
vGiven a collection of records (training set )
qEach record contains a set of attributes, one of the
attributes is the class.
vFind a model for class attribute as a
function of the values of other attributes.
vGoal: previously unseen records should be
assigned a class as accurately as possible.
qA test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build the
model and test set used to validate it.
27
Classification Example
Set
7 Yes Divorced 220K
8 No Single 85K Yes
9 No Married 75K No Learn
Yes
Training Model
10 No Single 90K
10
Set Classifier
28
Data Mining Tasks: Clustering
Find “natural” grouping of
instances given un-labeled data
29
Clustering Definition
vGiven a set of data points, each having a set
of attributes, and a similarity measure
among them, find clusters such that
qData points in one cluster are more similar to
one another.
qData points in separate clusters are less similar
to one another.
vSimilarity Measures:
qEuclidean Distance if attributes are continuous.
qOther Problem-specific Measures.
30
Illustrating Clustering
❘ Euclidean Distance Based Clustering in 3-D space.
31
Machine learning techniques
32
Related Fields
Machine Visualization
Learning
Data Mining and
Knowledge Discovery
Statistics Databases
33
Structural
descriptions
37
Association Rule Discovery:
Application 2
vSupermarket shelf management.
qGoal: To identify items that are bought
together by sufficiently many customers.
qApproach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
qA classic rule --
§ If a customer buys diaper and milk, then he is very
likely to buy beer.
§ So, don’t be surprised if you find six-packs stacked
next to diapers!
38
Regression
vPredict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
vGreatly studied in statistics, neural network fields.
vExamples:
qPredicting sales amounts of new product based on
advetising expenditure.
qPredicting wind velocities as a function of temperature,
humidity, air pressure, etc.
qTime series prediction of stock market indices.
39
Ross Quinlan
40
Classification vs.
association rules
v Classification rule:
predicts value of a given attribute (the classification of an
example)
If outlook = sunny and humidity = high
then play = no
v Association rule:
predicts value of arbitrary attribute (or combination)
If temperature = cool then humidity = normal
If humidity = normal and windy = false
then play = yes
If outlook = sunny and play = no
then humidity = high
If windy = false and play = no
then outlook = sunny and humidity = high 41
Weather data with mixed attributes
44
A decision tree for this problem
45
Classifying iris flowers
Attribute Type 1 2 3 … 40
Duration (Number of years) 1 2 3 2
Wage increase first year Percentage 2% 4% 4.3% 4.5
Wage increase second year Percentage ? 5% 4.4% 4.0
Wage increase third year Percentage ? ? ? ?
Cost of living adjustment {none,tcf,tc} none tcf ? none
Working hours per week (Number of hours) 28 35 38 40
Pension {none,ret-allw, empl-cntr} none ? ? ?
Standby pay Percentage ? 13% ? ?
Shift-work supplement Percentage ? 5% 4% 4
Education allowance {yes,no} yes ? ? ?
Statutory holidays (Number of days) 11 15 12 12
Vacation {below-avg,avg,gen} avg gen gen avg
Long-term disability assistance {yes,no} no ? ? yes
Dental plan contribution {none,half,full} none ? full full
Bereavement assistance {yes,no} no ? ? yes
Health plan contribution {none,half,full} none ? full half
Acceptability of contract {good,bad} bad good good good
48
Decision trees
for the labor data
49
Soybean classification
50
The role of domain knowledge
53
Enter machine learning
55
Enter machine learning
v Extract dark regions from normalized image
v Attributes:
q size of region
q shape, area
q intensity
q sharpness and jaggedness of boundaries
q proximity of other regions
q info about background
v Constraints:
q Few training examples—oil slicks are rare!
q Unbalanced data: most dark regions aren’t slicks
q Regions from same image form a batch
q Requirement: adjustable false-alarm rate
56
Load forecasting
v Electricity supply companies
need forecast of future demand
for power
v Forecasts of min/max load for each hour
Þ significant savings
v Given: manually constructed load model that
assumes “normal” climatic conditions
v Problem: adjust for weather conditions
v Static model consist of:
q base load for the year
q load periodicity over the year
q effect of holidays
57
Enter machine learning
58
Diagnosis of
machine faults
v Diagnosis: classical domain
of expert systems
v Given: Fourier analysis of vibrations measured at
various points of a device’s mounting
v Question: which fault is present?
v Preventative maintenance of electromechanical
motors and generators
v Information very noisy
v So far: diagnosis by expert/hand-crafted rules
59
Enter machine learning
60
Marketing and sales I
61
Marketing and
sales II
62
Machine learning and statistics
63
Statisticians
v Sir Ronald Aylmer Fisher
v Born: 17 Feb 1890 London, England
Died: 29 July 1962 Adelaide, Australia
v Numerous distinguished contributions to
developing the theory and application of
statistics for making quantitative a vast field of
biology
v Leo Breiman
v Developed decision trees
v 1984 Classification and
Regression Trees.
Wadsworth. 64
Data mining and ethics
65
Data mining and ethics II
v Important questions:
q Who is permitted access to the data?
q For what purpose was the data collected?
q What kind of conclusions can be legitimately
drawn from it?
v Caveats must be attached to results
v Purely statistical arguments are never sufficient!
v Are resources put to good use?
66