CS822-DataMining-Week1 (1)
CS822-DataMining-Week1 (1)
CS822
Data
Mining
Instructor: Dr. Muhammad Tahir
2
Why Data Mining?
• The rapid growth of Data
• Data collection and availability caused by Digital Transformation and
Automation
• Across all fields
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• As result of that, we are drowning in data, data mining help finding
knowledge within data.
• “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
3
“We are Drowning in Data...”
• Predict behavior of mankind
4
“We are Drowning in Data...”
• Law enforcement agencies collect unknown amounts of data
from various sources
• Cell phone calls
• Location data
• Web browsing behavior
• Credit card transactions
• Online profiles (Facebook)
• …
• Predict
• Terrorist or not?
• Trustworthiness
5
“...but starving for knowledge!”
• ← Amount of data that is collected
• ← Amount of data that can be looked at by
humans
8
Data Mining methods
• detect interesting patterns in large quantities of data
• Support human decision making by providing such
patterns
• Predict the outcome of a future observation based on
the patterns
9
Knowledge Discovery in Data:
Process
10
Data Mining Process
11
Knowledge Discovery in Data:
Process
• Example:
• Analysis of purchases in a supermarket
12
Knowledge Discovery in Data:
Challenges
13
Introduction to Data
14
Introduction to Data
• Introduction to Data
• Transactional Data
• Temporal Data
• Spatial & Spatial-Temporal Data
• Data Preprocessing
• Missing Values
• Summarization
15
Introduction to Data
16
Data Come from Everywhere
17
What is Data?
• Collection of records and their
attributes
• An attribute is a characteristic of an
object
• A collection of attributes describe an
object
18
Types of Data
• Record Data • Graph Data
• Transactional Data • World Wide Web
• Molecular Structures
• Temporal Data
• Time Series Data • Unstructured Data
• Sequence Data • Twitter Status Message
• Review, news article
• Spatial & Spatial-
Temporal Data • Semi-Structured Data
• Spatial Data • Paper Publications Data
20
Transaction Data
• Transaction data is a special type of record data, where
each record (transaction) involves a set of items.
Consider a grocery store.
21
Data Matrix
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi dimensional space, where each
dimension represents a distinct attribute
• Such data set can be represented by an m-by-n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute
22
Document term matrix
• Each document becomes a “term” vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.
23
Distance Matrix
24
Temporal Data
• Sequences Data
25
26
Temporal Data
• Time Series Data
27
Interval Data
• EL= { (A, 1, 5),( C, 3, 12), ( B, 4, 9), ( D, 9, 15) }
28
Spatial & Spatial-Temporal Data
• Spatial Data
29
Spatial & Spatial-Temporal Data
• Spatial Data
30
Spatial & Spatial-Temporal Data
• Spatial Data
31
Spatial & Spatial-Temporal Data
• Trajectory Data: Set of Hurricanes
32
Spatial & Spatial-Temporal Data
• Trajectory Data: (of 87 users obtained using RFID)
33
User Movement Data
• Trajectory
• Movement trail of a user
• Sampling Points: <latitude, longitude, time>
34
Graph-Based Data
• A graph can sometimes be a convenient and powerful
representation for data.
• We consider two specific cases:
1. the graph captures relationships among data objects
2. the data objects themselves are represented as graphs.
35
Graph Data
36
Semi-structured Data
37
Unstructured Data
38
Issues with Data
• Outliers
• Missing Values
• Inconsistent Values
• Duplicate Data
39
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality reduction
• Feature subset selection
• Feature creation
• Discretization and binarization
• Variable transformation
40
Data can help us solve specific
problems.
41
How should these pictures be placed
into 3 groups?
42
How should these pictures be placed
into groups? How many groups
should there be?
43
Which genes are associated with a disease? How
can expression values be used to predict survival?
44
What items should Amazon display
for me?
45
Is it likely that this stock was traded
based on illegal insider information?
46
Where are the faces in this picture?
47
Is this spam?
48
Will I like 300?
49
What techniques people apply on
data?
• They apply data mining algorithms and discover useful
knowledge
• So, what are the some of the well-known Data mining
Tasks?
• Clustering,
• Classification,
• Frequent Patterns,
• Association Rules,
• ….
50
What people do with the time series
data?
51
What people do with the trajectory
data?
52
Related Field
53
Related Field
• Statistics:
• more theory-based
• more focused on testing hypotheses
• Machine learning
• more heuristic
• focused on improving performance of a learning agent
• also looks at real-time learning and robotics – areas not part of data mining
55
Data Mining Tasks
• Cluster Analysis
• Classification
• Regression
• Association Analysis
56
Cluster Analysis
57
Cluster Analysis: Definition
• Given a set of data points, each having a set of attributes,
and a similarity measure among them, find groups such that
• data points in one group are more similar to one another
• data points in separate groups are less similar to one another
• Similarity Measures
• Euclidean distance if attributes are continuous
• Other task-specific similarity measures
• Goals
1. intra-cluster distances are minimized
2. inter-cluster distances are maximized
• Result
• A descriptive grouping of data points
58
Illustration of Clustering
• Euclidean Distance Based Clustering in 3-D space.
Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized
59
Cluster Analysis: Application 1
• Application area: Market segmentation
• Goal: Find groups of similar customers
• where a group may be conceived as a marketing target to be
reached with a distinct marketing mix
• Approach:
1. collect information about customers
2. find clusters of similar customers
3. measure the clustering quality by observing buying patterns
after targeting customers with distinct marketing mixes
60
Cluster Analysis: Application 2
• Application area: Document Clustering
• Goal: Find groups of documents that are similar to each
other based on terms appearing in them
• Approach
1. identify frequently occurring terms in each document
2. form a similarity measure based on the frequencies of
different terms
• Application Example: Grouping of articles in Google
News
61
Cluster Analysis: Document
Clustering
• Clustering Points: 3204 Articles of Los Angeles Times.
• Similarity Measure: How many words are common in
these documents (after some word filtering).
Category Total Correctly
Articles Placed
Financial 555 364
National 273 36
63
Classification: Definition
• Goal: Previously unseen records should be assigned a
class from a given set of classes as accurately as
possible.
• Approach:
• Given a collection of records (training set)
• each record contains a set of attributes ?
• one attribute is the class attribute (label) that should be
predicted
• Find a model for predicting the class attribute as a
function of the values of other attributes 64
Classification: Example
• Training set:
66
Classification: Workflow
cal cal us
i i o
gor gor i nu
te te nt ss
a a o a
c c c cl
Tid Home Marital Taxable Home Marital Taxable
Owner Status Income Default Owner Status Income Default
Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Set Classifier Model
10
67
Decision Tree - Example
cal cal u s
r i r i uo
o o n
teg teg nti
ass
ca ca co cl
Tid Home Marital Taxable
Splitting Attributes
Owner Status Income Default
68
Training Data Model: Decision Tree
Decision Tree - Example
cal cal us
i i o
or or nu
teg
teg
nti
ass Single,
l MarSt
ca ca co c
Married
Tid Home Marital Taxable
Divorced
Owner Status Income Default
NO HO
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
fits the same data!
10
69
Classification: Application 1
• Application area: Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
1. Use credit card transactions and information about account-
holders as attributes
• When and where does a customer buy? What does he buy?
• How often he pays on time? etc.
2. Label past transactions as fraud or fair transactions This
forms the class attribute
3. Learn a model for the class attribute from the transactions
4. Use this model to detect fraud by observing credit card
transactions on an account
70
Classification: Application 2
• Application area: Direct Marketing
• Goal: Reduce cost of a mailing campaign by targeting only the set of
consumers that likely to buy a new product
• Approach:
1. Use data from a campaign introducing a similar product in the past
• we know which customers decided to buy and which decided otherwise
• this {buy, don’t buy} decision forms the class attribute
2. Collect various demographic, lifestyle, and company-interaction related
information about the customers
• age, profession, location, income, marriage status, visits, logins, etc.
3. Use this information to learn a classification model
71
4. Apply model to decide which consumers to target
Regression
72
Regression
• Predict a value of a continuous variable based on the values of
other variables, assuming a linear or nonlinear model of
dependency
• Examples:
• Predicting sales amounts of new product based on advertising expenditure
• Predicting the price of a house or car
• Predicting miles per gallon (MPG) of a car as a function of its weight and
horsepower
• Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
• Difference to classification: The predicted attribute is continuous,
while classification is used to predict nominal attributes (e.g.
yes/no)
73
Association Analysis
74
Association Analysis: Definition
• Given a set of records each of which contain some
number of items from a given collection
• discover frequent item sets and produce association
rules which will predict occurrence of an item based on
occurrences of other items
75
Association Rule Discovery:
Applications 1
• Application area: Supermarket shelf management.
• Goal: To identify items that are bought together by sufficiently
many customers
• Approach: Process the point-of-sale data collected with barcode
scanners to find dependencies among items
• A classic rule and its implications:
• if a customer buys diapers and milk, then he is likely to buy beer as well
• so, don’t be surprised if you find six-packs stacked next to diapers!
• promote diapers to boost beer sales
• if selling diapers is discontinued, this will affect beer sales as well
• Application area: Sales Promotion
76
Association Rule Discovery:
Application 2
• Application area: Inventory Management
• Goal: A consumer appliance repair company wants to
anticipate the nature of repairs on its consumer
products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer
households
• Approach: Process the data on tools and parts required
in previous repairs at different consumer locations and
discover the co-occurrence patterns
77
Selection and Exploration
78
Selection and Exploration
• Selection
• What data is potentially useful for the task at hand?
• What data is available?
• What do I know about the quality of the data?
• Exploration / Profiling
• Get an initial understanding of the data
• Calculate basic summarization statistics
• Visualize the data
• Identify data problems such as outliers, missing values,
duplicate records
79
Visualization & Data Mining
• Visualizing the data to facilitate human discovery
• Presenting the discovered results in a visually "nice"
way
80
Summarization
• Describe features of the selected group
• Use natural language and graphics
• Usually in Combination with Deviation detection or
other methods
81
Data Mining Models and Tasks
82
Preprocessing and Transformation
83
Preprocessing and Transformation
• Transform data into a representation that is suitable for the chosen data
mining methods
• scales of attributes (nominal, ordinal, numeric)
• number of dimensions (represent relevant information using less attributes)
• amount of data (determines hardware requirements)
• Methods
• discretization and binarization
• feature subset selection / dimensionality reduction
• attribute transformation / text to term vector / embeddings
• aggregation, sampling
• integrate data from multiple sources
• Good data preparation is key to producing valid and reliable models
• Data integration and preparation is estimated to take 70-80% of the time
and effort of a data mining project 84
Data Mining
85
Data Mining
• Input: Preprocessed Data
• Output: Model / Patterns
87
Deployment
• Use model in the business context
• Keep iterating in order to maintain and improve model
1 2
3
Data
Preparation
Data Sources
6
4
Deployment
Model
Building
What data scientists spend the most time doing. [taken from
Data Scientist: The Dirtiest Job of the 21st Century | by Jingles (H
ong Jing) | TDS Archive | Medium 89
Data Mining Software
90
RapidMiner
• Powerful data mining suite
• Visual modelling of data mining pipelines
• Commercial tool, offering educational licenses
91
Gartner 2018 Magic Quadrant for Data
Science and Machine Learning Platforms
Gainers and Losers in Gartner 2018 Magic Quadrant for Data Scie
nce and Machine Learning Platforms - 92
KDnuggets
Performance Assessment Methods
• In classification problems, the primary source for
accuracy estimation is the confusion matrix
TP TN
True Class Accuracy
TP TN FP FN
Positive Negative
TP
True False True Positive Rate
Positive
TP FN
Positive Positive
Predicted Class
False True
Negative Negative TP TP
Count (FN) Count (TN) P recision Recall
TP FP TP FN
93
Estimation Methodologies for
Classification
• Simple split (or holdout or test sample estimation)
• Split the data into 2 mutually exclusive sets training (~70%)
and testing (30%)
Model
Training Data Development
2/3
Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)
95
Estimation Methodologies for
Classification – ROC Curve
1
0.9
0.8
A
B
0.6
C
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
97