0% found this document useful (0 votes)
18 views

Lecture 2

The document discusses data mining and how it can be used to extract useful patterns and information from large amounts of data to help with tasks like classification, clustering, and prediction. It provides examples of how data mining has been applied successfully in domains like customer retention, credit risk assessment, product recommendations, and medical diagnosis. The growth of data from various sources is outpacing our ability to analyze it manually, making automated data mining techniques increasingly important.

Uploaded by

adane asfaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Lecture 2

The document discusses data mining and how it can be used to extract useful patterns and information from large amounts of data to help with tasks like classification, clustering, and prediction. It provides examples of how data mining has been applied successfully in domains like customer retention, credit risk assessment, product recommendations, and medical diagnosis. The growth of data from various sources is outpacing our ability to analyze it manually, making automated data mining techniques increasingly important.

Uploaded by

adane asfaw
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 66

Data Mining

What’s it all about?


v Data vs information
v Data mining and machine learning
v Structural descriptions
q Rules: classification and association
q Decision trees
v Data mining and ethics

2
Historical Note:
Many Names of Data Mining
v Data Fishing, Data Dredging: 1960-
q used by Statistician (as bad name)
v Data Mining :1990 --
q used DB, business
q in 2003 – bad image because of TIA
v Knowledge Discovery in Databases (1989-)
q used by AI, Machine Learning Community
v also Data Archaeology, Information Harvesting,
Information Discovery, Knowledge Extraction, ...

Currently: Data Mining and Knowledge Discovery


are used interchangeably

3
Data vs. information

v Society produces huge amounts of data


q Sources: business, science, medicine,
economics, geography, environment, sports, …
v Potentially valuable resource
v Raw data is useless: need techniques to
automatically extract information from it
q Data: recorded facts
q Information: patterns underlying the data

4
Trends leading to Data Flood
This image cannot currently be displayed.

v More data is
generated:
q Bank, telecom,
other business
transactions ...
q Scientific data:
astronomy,
biology, etc
q Web, text, and e-
commerce

5
Big Data Examples
v Europe's Very Long Baseline Interferometry (VLBI) has
16 telescopes, each of which produces 1
Gigabit/second of astronomical data over a 25-day
observation session
q storage and analysis a big problem
v AT&T handles billions of calls per day
q so much data, it cannot be all stored -- analysis
has to be done “on the fly”, on streaming data

6
Largest databases in 2003
v Commercial databases:
q Winter Corp. 2003 Survey: France Telecom
has largest decision-support DB, ~30TB; AT&T
~ 26 TB
v Web
q Alexa internet archive: 7 years of data, 500 TB
q Google searches 4+ Billion pages, many
hundreds TB
q IBM WebFountain, 160 TB (2003)
q Internet Archive (www.archive.org),~ 300 TB

7
5 million terabytes created in 2002

v UC Berkeley 2003 estimate: 5 exabytes (5 million


terabytes) of new data was created in 2002.
www.sims.berkeley.edu/research/projects/how-much-
info-2003/
v US produces ~40% of new stored data worldwide

8
Data Growth Rate

v Twice as much information was created in 2002 as


in 1999 (~30% growth rate)
v Other growth rate estimates even higher
v Very little data will ever be looked at by a human
v Knowledge Discovery is NEEDED to make sense
and use of data.

9
Machine Learning / Data Mining
Application areas
v Science
q astronomy, bioinformatics, drug discovery, …
v Business
q advertising, CRM (Customer Relationship
management), investments, manufacturing,
sports/entertainment, telecom, e-Commerce,
targeted marketing, health care, …
v Web:
q search engines, bots, …
v Government
q profiling tax cheaters, anti-terror(?)

10
Data Mining for Customer
Modeling

v Customer Tasks:
q attrition prediction
q targeted marketing:
§ cross-sell, customer acquisition
q credit-risk
q fraud detection
v Industries
q banking, telecom, retail sales, …

11
Customer Attrition: Case Study

v Situation: Attrition rate at for mobile phone customers is


around 25-30% a year!
Task:
v Given customer information for the past N months,
predict who is likely to attrite next month.
v Also, estimate customer value and what is the cost-
effective offer to be made to this customer.

12
Customer Attrition Results

v Verizon Wireless built a customer data warehouse


v Identified potential attriters
v Developed multiple, regional models
v Targeted customers with high propensity to accept
the offer
v Reduced attrition rate from over 2%/month to
under 1.5%/month (huge impact, with >30 M
subscribers)
(Reported in 2003)

13
Assessing Credit Risk: Case Study

v Situation: Person applies for a loan


v Task: Should a bank approve the loan?
v Note: People who have the best credit don’t need
the loans, and people with worst credit are not
likely to repay. Bank’s best customers are in the
middle

14
Credit Risk - Results

v Banks develop credit models using variety of


machine learning methods.
v Mortgage and credit card proliferation are the
results of being able to successfully predict if a
person is likely to default on a loan
v Widely deployed in many countries

15
Successful e-commerce – Case Study

v A person buys a book (product) at Amazon.com.


v Task: Recommend other books (products) this person
is likely to buy
v Amazon does clustering based on books bought:
q customers who bought “Advances in
Knowledge Discovery and Data Mining”, also
bought “Data Mining: Practical Machine
Learning Tools and Techniques with Java
Implementations”
v Recommendation program is quite successful

16
Genomic Microarrays – Case Study

Given microarray data for a number of samples


(patients), can we
v Accurately diagnose the disease?
v Predict outcome for given treatment?
v Recommend best treatment?

17
Example: ALL/AML data

v 38 training cases, 34 test, ~ 7,000 genes


v 2 Classes: Acute Lymphoblastic Leukemia (ALL) vs Acute
Myeloid Leukemia (AML)
v Use train data to build diagnostic model

ALL AML

Results on test data:


33/34 correct, 1 error may be mislabeled
18
Security and Fraud Detection - Case
Study

v Credit Card Fraud Detection


v Detection of Money laundering
q FAIS (US Treasury)
v Securities Fraud
q NASDAQ KDD system
v Phone fraud
q AT&T, Bell Atlantic, British
Telecom/MCI
v Bio-terrorism detection at Salt Lake
Olympics 2002

19
Problems Suitable for Data-Mining

vrequire knowledge-based decisions


vhave a changing environment
vhave sub-optimal current methods
vhave accessible, sufficient, and relevant data
vprovides high payoff for the right decisions!

Privacy considerations important if personal data


is involved

20
Information is crucial

v Example 1: in vitro fertilization


q Given: embryos described by 60 features
q Problem: selection of embryos that will survive
q Data: historical records of embryos and outcome
v Example 2: cow culling
q Given: cows described by 700 features
q Problem: selection of cows that should be culled
q Data: historical records and farmers’ decisions

21
Data mining

v Extracting
q implicit,
q previously unknown,
q potentially useful
information from data
v Needed: programs that detect patterns and regularities
in the data
v Strong patterns Þ good predictions
q Problem 1: most patterns are not interesting
q Problem 2: patterns may be inexact (or
spurious)
q Problem 3: data may be garbled or missing
22
What is Data Mining?
vMany Definitions
q Non-trivial extraction of implicit, previously
unknown and potentially useful information
from data
q Exploration & analysis, by automatic or
semi-automatic means, of
large quantities of data
in order to discover
meaningful patterns

23
Data Mining Tasks

v Prediction Methods
q Use some variables to predict unknown
or future values of other variables.

v Description Methods
q Find human-interpretable patterns that
describe the data.

24
Major Data Mining Tasks
v Classification: predicting an item class
v Clustering: finding clusters in data
v Associations: e.g. A & B & C occur frequently
v Visualization: to facilitate human discovery
v Summarization: describing a group
v Deviation Detection: finding changes
v Estimation: predicting a continuous value
v Link Analysis: finding relationships
v…

25
Data Mining Tasks:
Classification
Learn a method for predicting the instance class from
pre-labeled (classified) instances

Many approaches:
Statistics,
Decision Trees,
Neural Networks,
...

26
Classification: Definition
vGiven a collection of records (training set )
qEach record contains a set of attributes, one of the
attributes is the class.
vFind a model for class attribute as a
function of the values of other attributes.
vGoal: previously unseen records should be
assigned a class as accurately as possible.
qA test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build the
model and test set used to validate it.

27
Classification Example

Tid Refund Marital Taxable Refund Marital Taxable


Status Income Cheat Status Income Cheat

1 Yes Single 125K No No Single 75K ?


2 No Married 100K No Yes Married 50K ?
3 No Single 70K No No Married 150K ?
4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
No
10

Set
7 Yes Divorced 220K
8 No Single 85K Yes
9 No Married 75K No Learn
Yes
Training Model
10 No Single 90K
10

Set Classifier

28
Data Mining Tasks: Clustering
Find “natural” grouping of
instances given un-labeled data

29
Clustering Definition
vGiven a set of data points, each having a set
of attributes, and a similarity measure
among them, find clusters such that
qData points in one cluster are more similar to
one another.
qData points in separate clusters are less similar
to one another.
vSimilarity Measures:
qEuclidean Distance if attributes are continuous.
qOther Problem-specific Measures.

30
Illustrating Clustering
❘ Euclidean Distance Based Clustering in 3-D space.

Intracluster distances Intercluster distances


are minimized are maximized

31
Machine learning techniques

v Algorithms for acquiring structural


descriptions from examples
v Structural descriptions represent patterns
explicitly
q Can be used to predict outcome in new
situation
q Can be used to understand and explain how
prediction is derived
(may be even more important)
v Methods originate from artificial intelligence,
statistics, and research on databases

32
Related Fields

Machine Visualization
Learning
Data Mining and
Knowledge Discovery

Statistics Databases

33
Structural
descriptions

v Example: if-then rules


If tear production rate = reduced
then recommendation = none
Otherwise, if age = young and astigmatic = no
then recommendation = soft

Age Spectacle Astigmatism Tear production Recommended


prescription rate lenses
Young Myope No Reduced None
Young Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope No Reduced None
Presbyopic Myope Yes Normal Hard
… … … … …
34
The weather problem

v Conditions for playing a certain game


Outlook Temperature Humidity Windy Play
Sunny Hot High False No
Sunny Hot High True No
Overcast Hot High False Yes
Rainy Mild Normal False Yes
… … … … …

If outlook = sunny and humidity = high then play = no


If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
35
Association Rule
Discovery: Definition
v Given a set of records each of which
contain some number of items from a given
collection;
q Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items.
TID Items
1 Bread, Coke, Milk Rules Discovered:
2 Beer, Bread {Milk} --> {Coke}
3 Beer, Coke, Diaper, Milk {Diaper, Milk} --> {Beer}
4 Beer, Bread, Diaper, Milk
5 Coke, Diaper, Milk
36
Association Rule Discovery:
Application 1
vMarketing and Sales Promotion:
qLet the rule discovered be
{Bagels, … } --> {Potato Chips}
qPotato Chips as consequent => Can be used to
determine what should be done to boost its sales.
qBagels in the antecedent => Can be used to see
which products would be affected if the store
discontinues selling bagels.
qBagels in antecedent and Potato chips in consequent
=> Can be used to see what products should be sold
with Bagels to promote sale of Potato chips!

37
Association Rule Discovery:
Application 2
vSupermarket shelf management.
qGoal: To identify items that are bought
together by sufficiently many customers.
qApproach: Process the point-of-sale data
collected with barcode scanners to find
dependencies among items.
qA classic rule --
§ If a customer buys diaper and milk, then he is very
likely to buy beer.
§ So, don’t be surprised if you find six-packs stacked
next to diapers!
38
Regression
vPredict a value of a given continuous valued
variable based on the values of other variables,
assuming a linear or nonlinear model of
dependency.
vGreatly studied in statistics, neural network fields.
vExamples:
qPredicting sales amounts of new product based on
advetising expenditure.
qPredicting wind velocities as a function of temperature,
humidity, air pressure, etc.
qTime series prediction of stock market indices.

39
Ross Quinlan

v Machine learning researcher from 1970’s


v University of Sydney, Australia
1986 “Induction of decision trees” ML Journal
1993 C4.5: Programs for machine learning.
Morgan Kaufmann

40
Classification vs.
association rules
v Classification rule:
predicts value of a given attribute (the classification of an
example)
If outlook = sunny and humidity = high
then play = no

v Association rule:
predicts value of arbitrary attribute (or combination)
If temperature = cool then humidity = normal
If humidity = normal and windy = false
then play = yes
If outlook = sunny and play = no
then humidity = high
If windy = false and play = no
then outlook = sunny and humidity = high 41
Weather data with mixed attributes

v Some attributes have numeric values


Outlook Temperature Humidity Windy Play
Sunny 85 85 False No
Sunny 80 90 True No
Overcast 83 86 False Yes
Rainy 75 80 False Yes
… … … … …

If outlook = sunny and humidity > 83 then play = no


If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity < 85 then play = yes
If none of the above then play = yes
42
The contact lenses data
Age Spectacle Astigmatism Tear production Recommended
prescription rate lenses
Young Myope No Reduced None
Young Myope No Normal Soft
Young Myope Yes Reduced None
Young Myope Yes Normal Hard
Young Hypermetrope No Reduced None
Young Hypermetrope No Normal Soft
Young Hypermetrope Yes Reduced None
Young Hypermetrope Yes Normal hard
Pre-presbyopic Myope No Reduced None
Pre-presbyopic Myope No Normal Soft
Pre-presbyopic Myope Yes Reduced None
Pre-presbyopic Myope Yes Normal Hard
Pre-presbyopic Hypermetrope No Reduced None
Pre-presbyopic Hypermetrope No Normal Soft
Pre-presbyopic Hypermetrope Yes Reduced None
Pre-presbyopic Hypermetrope Yes Normal None
Presbyopic Myope No Reduced None
Presbyopic Myope No Normal None
Presbyopic Myope Yes Reduced None
Presbyopic Myope Yes Normal Hard
Presbyopic Hypermetrope No Reduced None
Presbyopic Hypermetrope No Normal Soft
Presbyopic Hypermetrope Yes Reduced None
Presbyopic Hypermetrope Yes Normal None 43
A complete and correct
rule set
If tear production rate = reduced then recommendation = none
If age = young and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = pre-presbyopic and astigmatic = no
and tear production rate = normal then recommendation = soft
If age = presbyopic and spectacle prescription = myope
and astigmatic = no then recommendation = none
If spectacle prescription = hypermetrope and astigmatic = no
and tear production rate = normal then recommendation = soft
If spectacle prescription = myope and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age young and astigmatic = yes
and tear production rate = normal then recommendation = hard
If age = pre-presbyopic
and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none
If age = presbyopic and spectacle prescription = hypermetrope
and astigmatic = yes then recommendation = none

44
A decision tree for this problem

45
Classifying iris flowers

Sepal length Sepal width Petal length Petal width Type


1 5.1 3.5 1.4 0.2 Iris setosa
2 4.9 3.0 1.4 0.2 Iris setosa

51 7.0 3.2 4.7 1.4 Iris versicolor
52 6.4 3.2 4.5 1.5 Iris versicolor

101 6.3 3.3 6.0 2.5 Iris virginica
102 5.8 2.7 5.1 1.9 Iris virginica

If petal length < 2.45 then Iris setosa


If sepal width < 2.10 then Iris versicolor
... 46
Predicting CPU performance

v Example: 209 different computer configurations


Cycle time Main memory Cache Channels Performance
(ns) (Kb) (Kb)
MYCT MMIN MMAX CACH CHMIN CHMAX PRP
1 125 256 6000 256 16 128 198
2 29 8000 32000 32 8 32 269

208 480 512 8000 32 0 0 67
209 480 1000 4000 0 0 0 45

v Linear regression function


PRP = -55.9 + 0.0489 MYCT + 0.0153 MMIN + 0.0056 MMAX
+ 0.6410 CACH - 0.2700 CHMIN + 1.480 CHMAX
47
Data from labor negotiations

Attribute Type 1 2 3 … 40
Duration (Number of years) 1 2 3 2
Wage increase first year Percentage 2% 4% 4.3% 4.5
Wage increase second year Percentage ? 5% 4.4% 4.0
Wage increase third year Percentage ? ? ? ?
Cost of living adjustment {none,tcf,tc} none tcf ? none
Working hours per week (Number of hours) 28 35 38 40
Pension {none,ret-allw, empl-cntr} none ? ? ?
Standby pay Percentage ? 13% ? ?
Shift-work supplement Percentage ? 5% 4% 4
Education allowance {yes,no} yes ? ? ?
Statutory holidays (Number of days) 11 15 12 12
Vacation {below-avg,avg,gen} avg gen gen avg
Long-term disability assistance {yes,no} no ? ? yes
Dental plan contribution {none,half,full} none ? full full
Bereavement assistance {yes,no} no ? ? yes
Health plan contribution {none,half,full} none ? full half
Acceptability of contract {good,bad} bad good good good

48
Decision trees
for the labor data

49
Soybean classification

Attribute Number Sample value


of values
Environment Time of occurrence 7 July
Precipitation 3 Above normal

Seed Condition 2 Normal
Mold growth 2 Absent

Fruit Condition of fruit 4 Normal
pods
Fruit spots 5 ?
Leaves Condition 2 Abnormal
Leaf spot size 3 ?

Stem Condition 2 Abnormal
Stem lodging 2 Yes

Roots Condition 3 Normal
Diagnosis 19 Diaporthe stem canker

50
The role of domain knowledge

If leaf condition is normal


and stem condition is abnormal
and stem cankers is below soil line
and canker lesion color is brown
then
diagnosis is rhizoctonia root rot

If leaf malformation is absent


and stem condition is abnormal
and stem cankers is below soil line
and canker lesion color is brown
then
diagnosis is rhizoctonia root rot

But in this domain, “leaf condition is normal” implies


“leaf malformation is absent”! 51
Fielded applications
v The result of learning
—or the learning method itself—
is deployed in practical applications
q Processing loan applications
q Screening images for oil slicks
q Electricity supply forecasting
q Diagnosis of machine faults
q Marketing and sales
q Reducing banding in rotogravure printing
q Autoclave layout for aircraft parts
q Automatic classification of sky objects
q Automated completion of repetitive forms
q Text retrieval
52
Processing loan applications
(American Express)
v Given: questionnaire with
financial and personal information
v Question: should money be lent?
v Simple statistical method covers 90% of cases
v Borderline cases referred to loan officers
v But: 50% of accepted borderline cases defaulted!
v Solution: reject all borderline cases?
q No! Borderline cases are most active
customers

53
Enter machine learning

v 1000 training examples of borderline cases


v 20 attributes:
q age
q years with current employer
q years at current address
q years with the bank
q other credit cards possessed,…
v Learned rules: correct on 70% of cases
q human experts only 50%
v Rules could be used to explain decisions to
customers
54
Screening images

v Given: radar satellite images of coastal


waters
v Problem: detect oil slicks in those
images
v Oil slicks appear as dark regions with
changing size and shape
v Not easy: lookalike dark regions can be
caused by weather conditions (e.g.
high wind)
v Expensive process requiring highly
trained personnel

55
Enter machine learning
v Extract dark regions from normalized image
v Attributes:
q size of region
q shape, area
q intensity
q sharpness and jaggedness of boundaries
q proximity of other regions
q info about background
v Constraints:
q Few training examples—oil slicks are rare!
q Unbalanced data: most dark regions aren’t slicks
q Regions from same image form a batch
q Requirement: adjustable false-alarm rate

56
Load forecasting
v Electricity supply companies
need forecast of future demand
for power
v Forecasts of min/max load for each hour
Þ significant savings
v Given: manually constructed load model that
assumes “normal” climatic conditions
v Problem: adjust for weather conditions
v Static model consist of:
q base load for the year
q load periodicity over the year
q effect of holidays

57
Enter machine learning

v Prediction corrected using “most similar” days


v Attributes:
q temperature
q humidity
q wind speed
q cloud cover readings
q plus difference between actual load and
predicted load
v Average difference among three “most similar” days
added to static model
v Linear regression coefficients form attribute weights
in similarity function

58
Diagnosis of
machine faults
v Diagnosis: classical domain
of expert systems
v Given: Fourier analysis of vibrations measured at
various points of a device’s mounting
v Question: which fault is present?
v Preventative maintenance of electromechanical
motors and generators
v Information very noisy
v So far: diagnosis by expert/hand-crafted rules

59
Enter machine learning

v Available: 600 faults with expert’s diagnosis


v ~300 unsatisfactory, rest used for training
v Attributes augmented by intermediate concepts that
embodied causal domain knowledge
v Expert not satisfied with initial rules because they did
not relate to his domain knowledge
v Further background knowledge resulted in more
complex rules that were satisfactory
v Learned rules outperformed hand-crafted ones

60
Marketing and sales I

v Companies precisely record massive amounts of


marketing and sales data
v Applications:
q Customer loyalty:
identifying customers that are likely to defect
by detecting changes in their behavior
(e.g. banks/phone companies)
q Special offers:
identifying profitable customers
(e.g. reliable owners of credit cards that need
extra money during the holiday season)

61
Marketing and
sales II

v Market basket analysis


q Association techniques find
groups of items that tend to
occur together in a
transaction
(used to analyze checkout data)
v Historical analysis of purchasing patterns
v Identifying prospective customers
q Focusing promotional mailouts
(targeted campaigns are cheaper than mass-
marketed ones)

62
Machine learning and statistics

v Historical difference (grossly oversimplified):


q Statistics: testing hypotheses
q Machine learning: finding the right hypothesis
v But: huge overlap
q Decision trees (C4.5 and CART)
q Nearest-neighbor methods
v Today: perspectives have converged
q Most ML algorithms employ statistical techniques

63
Statisticians
v Sir Ronald Aylmer Fisher
v Born: 17 Feb 1890 London, England
Died: 29 July 1962 Adelaide, Australia
v Numerous distinguished contributions to
developing the theory and application of
statistics for making quantitative a vast field of
biology

v Leo Breiman
v Developed decision trees
v 1984 Classification and
Regression Trees.
Wadsworth. 64
Data mining and ethics

v Ethical issues arise in


practical applications
v Data mining often used to discriminate
q E.g. loan applications: using some information
(e.g. sex, religion, race) is unethical
v Ethical situation depends on application
q E.g. same information ok in medical
application
v Attributes may contain problematic information
q E.g. area code may correlate with race

65
Data mining and ethics II

v Important questions:
q Who is permitted access to the data?
q For what purpose was the data collected?
q What kind of conclusions can be legitimately
drawn from it?
v Caveats must be attached to results
v Purely statistical arguments are never sufficient!
v Are resources put to good use?

66

You might also like