0% found this document useful (0 votes)
2 views

CS822-DataMining-Week1 (1)

The document discusses the importance of data mining in extracting knowledge from the vast amounts of data generated across various fields due to digital transformation. It outlines the data mining process, methods, and challenges, emphasizing the need for pattern discovery to aid decision-making. Additionally, it describes different types of data and data preprocessing techniques, as well as various data mining tasks such as clustering, classification, and regression.

Uploaded by

zainab zahid
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

CS822-DataMining-Week1 (1)

The document discusses the importance of data mining in extracting knowledge from the vast amounts of data generated across various fields due to digital transformation. It outlines the data mining process, methods, and challenges, emphasizing the need for pattern discovery to aid decision-making. Additionally, it describes different types of data and data preprocessing techniques, as well as various data mining tasks such as clustering, classification, and regression.

Uploaded by

zainab zahid
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 97

1

CS822
Data
Mining
Instructor: Dr. Muhammad Tahir

2
Why Data Mining?
• The rapid growth of Data
• Data collection and availability caused by Digital Transformation and
Automation
• Across all fields
• Business: Web, e-commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube
• As result of that, we are drowning in data, data mining help finding
knowledge within data.
• “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data sets
3
“We are Drowning in Data...”
• Predict behavior of mankind

4
“We are Drowning in Data...”
• Law enforcement agencies collect unknown amounts of data
from various sources
• Cell phone calls
• Location data
• Web browsing behavior
• Credit card transactions
• Online profiles (Facebook)
• …

• Predict
• Terrorist or not?
• Trustworthiness
5
“...but starving for knowledge!”
• ← Amount of data that is collected
• ← Amount of data that can be looked at by
humans

• We are interested in the patterns, not the


data itself!
• Data Mining methods help us to
• discover interesting patterns in large quantities of
data
• take decisions based on the patterns 6
What is Data Mining?
• Data mining has multiple names depending on time and field
• Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.

• Data mining is a process of extracting knowledge from data


regardless of the reasons and what these knowledge is used
for.
• In other words, the extraction of interesting (non-trivial,
implicit, previously unknown and potentially useful)
patterns or knowledge from data.
7
What is Data Mining?
• The efficient discovery of previously unknown, valid,
potentially useful, understandable patterns in large
datasets
• The analysis of (often large) observational data sets to
find unsuspected relationships and to summarize the
data in novel ways that are both understandable and
useful to the data owner
• Patterns must be valid, novel, potentially useful,
understandable

8
Data Mining methods
• detect interesting patterns in large quantities of data
• Support human decision making by providing such
patterns
• Predict the outcome of a future observation based on
the patterns

9
Knowledge Discovery in Data:
Process

10
Data Mining Process

11
Knowledge Discovery in Data:
Process
• Example:
• Analysis of purchases in a supermarket

12
Knowledge Discovery in Data:
Challenges

13
Introduction to Data

14
Introduction to Data
• Introduction to Data
• Transactional Data
• Temporal Data
• Spatial & Spatial-Temporal Data

• Data Preprocessing
• Missing Values
• Summarization

15
Introduction to Data

16
Data Come from Everywhere

17
What is Data?
• Collection of records and their
attributes
• An attribute is a characteristic of an
object
• A collection of attributes describe an
object

18
Types of Data
• Record Data • Graph Data
• Transactional Data • World Wide Web
• Molecular Structures
• Temporal Data
• Time Series Data • Unstructured Data
• Sequence Data • Twitter Status Message
• Review, news article
• Spatial & Spatial-
Temporal Data • Semi-Structured Data
• Spatial Data • Paper Publications Data

• Spatial-Temporal Data • XML format


19
Record Data
• Data that consists of a collection of records, each of
which consists of a fixed set of attributes

20
Transaction Data
• Transaction data is a special type of record data, where
each record (transaction) involves a set of items.
Consider a grocery store.

21
Data Matrix
• If data objects have the same fixed set of numeric
attributes, then the data objects can be thought of as
points in a multi dimensional space, where each
dimension represents a distinct attribute
• Such data set can be represented by an m-by-n matrix,
where there are m rows, one for each object, and n
columns, one for each attribute

22
Document term matrix
• Each document becomes a “term” vector,
• each term is a component (attribute) of the vector,
• the value of each component is the number of times the
corresponding term occurs in the document.

23
Distance Matrix

24
Temporal Data
• Sequences Data

25
26
Temporal Data
• Time Series Data

27
Interval Data
• EL= { (A, 1, 5),( C, 3, 12), ( B, 4, 9), ( D, 9, 15) }

28
Spatial & Spatial-Temporal Data
• Spatial Data

29
Spatial & Spatial-Temporal Data
• Spatial Data

Average Monthly Temperature of land and ocean

30
Spatial & Spatial-Temporal Data
• Spatial Data

31
Spatial & Spatial-Temporal Data
• Trajectory Data: Set of Hurricanes

32
Spatial & Spatial-Temporal Data
• Trajectory Data: (of 87 users obtained using RFID)

33
User Movement Data
• Trajectory
• Movement trail of a user
• Sampling Points: <latitude, longitude, time>

34
Graph-Based Data
• A graph can sometimes be a convenient and powerful
representation for data.
• We consider two specific cases:
1. the graph captures relationships among data objects
2. the data objects themselves are represented as graphs.

35
Graph Data

36
Semi-structured Data

37
Unstructured Data

38
Issues with Data
• Outliers
• Missing Values
• Inconsistent Values
• Duplicate Data

39
Data Preprocessing
• Aggregation
• Sampling
• Dimensionality reduction
• Feature subset selection
• Feature creation
• Discretization and binarization
• Variable transformation

40
Data can help us solve specific
problems.

41
How should these pictures be placed
into 3 groups?

42
How should these pictures be placed
into groups? How many groups
should there be?

43
Which genes are associated with a disease? How
can expression values be used to predict survival?

44
What items should Amazon display
for me?

45
Is it likely that this stock was traded
based on illegal insider information?

46
Where are the faces in this picture?

47
Is this spam?

48
Will I like 300?

49
What techniques people apply on
data?
• They apply data mining algorithms and discover useful
knowledge
• So, what are the some of the well-known Data mining
Tasks?
• Clustering,
• Classification,
• Frequent Patterns,
• Association Rules,
• ….

50
What people do with the time series
data?

51
What people do with the trajectory
data?

52
Related Field

53
Related Field
• Statistics:
• more theory-based
• more focused on testing hypotheses

• Machine learning
• more heuristic
• focused on improving performance of a learning agent
• also looks at real-time learning and robotics – areas not part of data mining

• Data Mining and Knowledge Discovery


• integrates theory and heuristics
• focus on the entire process of knowledge discovery, including data cleaning,
learning, and integration and visualization of results

• Distinctions are fuzzy 54


Data Mining Tasks

55
Data Mining Tasks
• Cluster Analysis
• Classification
• Regression
• Association Analysis

56
Cluster Analysis

57
Cluster Analysis: Definition
• Given a set of data points, each having a set of attributes,
and a similarity measure among them, find groups such that
• data points in one group are more similar to one another
• data points in separate groups are less similar to one another
• Similarity Measures
• Euclidean distance if attributes are continuous
• Other task-specific similarity measures
• Goals
1. intra-cluster distances are minimized
2. inter-cluster distances are maximized
• Result
• A descriptive grouping of data points
58
Illustration of Clustering
• Euclidean Distance Based Clustering in 3-D space.

Intracluster
Intraclusterdistances
distances Intercluster
Interclusterdistances
distances
are
areminimized
minimized are
aremaximized
maximized

59
Cluster Analysis: Application 1
• Application area: Market segmentation
• Goal: Find groups of similar customers
• where a group may be conceived as a marketing target to be
reached with a distinct marketing mix

• Approach:
1. collect information about customers
2. find clusters of similar customers
3. measure the clustering quality by observing buying patterns
after targeting customers with distinct marketing mixes

60
Cluster Analysis: Application 2
• Application area: Document Clustering
• Goal: Find groups of documents that are similar to each
other based on terms appearing in them
• Approach
1. identify frequently occurring terms in each document
2. form a similarity measure based on the frequencies of
different terms
• Application Example: Grouping of articles in Google
News

61
Cluster Analysis: Document
Clustering
• Clustering Points: 3204 Articles of Los Angeles Times.
• Similarity Measure: How many words are common in
these documents (after some word filtering).
Category Total Correctly
Articles Placed
Financial 555 364

Foreign 341 260

National 273 36

Metro 943 746

Sports 738 573

Entertainment 354 278


62
Classification

63
Classification: Definition
• Goal: Previously unseen records should be assigned a
class from a given set of classes as accurately as
possible.

• Approach:
• Given a collection of records (training set)
• each record contains a set of attributes ?
• one attribute is the class attribute (label) that should be
predicted
• Find a model for predicting the class attribute as a
function of the values of other attributes 64
Classification: Example
• Training set:

• Learned model: "Trees are big, green plants without


wheels. 65
Classification: Workflow

66
Classification: Workflow
cal cal us
i i o
gor gor i nu
te te nt ss
a a o a
c c c cl
Tid Home Marital Taxable Home Marital Taxable
Owner Status Income Default Owner Status Income Default

1 Yes Single 125K No No Single 75K ?

2 No Married 100K No Yes Married 50K ?

3 No Single 70K No No Married 150K ?


4 Yes Married 120K No Yes Divorced 90K ?
5 No Divorced 95K Yes No Single 40K ?
6 No Married 60K No No Married 80K ? Test
10

Set
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
Training
Learn
10 No Single 90K Yes Set Classifier Model
10

67
Decision Tree - Example
cal cal u s
r i r i uo
o o n
teg teg nti
ass
ca ca co cl
Tid Home Marital Taxable
Splitting Attributes
Owner Status Income Default

1 Yes Single 125K No


2 No Married 100K No HO
Yes No
3 No Single 70K No
4 Yes Married 120K No NO MarSt
5 No Divorced 95K Yes Single, Divorced Married
6 No Married 60K No
TaxInc NO
7 Yes Divorced 220K No
8 No Single 85K Yes
< 80K > 80K

9 No Married 75K No NO YES


10 No Single 90K Yes
10

68
Training Data Model: Decision Tree
Decision Tree - Example
cal cal us
i i o
or or nu
teg
teg
nti
ass Single,
l MarSt
ca ca co c
Married
Tid Home Marital Taxable
Divorced
Owner Status Income Default
NO HO
1 Yes Single 125K No No
Yes
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes
fits the same data!
10

69
Classification: Application 1
• Application area: Fraud Detection
• Goal: Predict fraudulent cases in credit card transactions.
• Approach:
1. Use credit card transactions and information about account-
holders as attributes
• When and where does a customer buy? What does he buy?
• How often he pays on time? etc.
2. Label past transactions as fraud or fair transactions This
forms the class attribute
3. Learn a model for the class attribute from the transactions
4. Use this model to detect fraud by observing credit card
transactions on an account
70
Classification: Application 2
• Application area: Direct Marketing
• Goal: Reduce cost of a mailing campaign by targeting only the set of
consumers that likely to buy a new product
• Approach:
1. Use data from a campaign introducing a similar product in the past
• we know which customers decided to buy and which decided otherwise
• this {buy, don’t buy} decision forms the class attribute
2. Collect various demographic, lifestyle, and company-interaction related
information about the customers
• age, profession, location, income, marriage status, visits, logins, etc.
3. Use this information to learn a classification model
71
4. Apply model to decide which consumers to target
Regression

72
Regression
• Predict a value of a continuous variable based on the values of
other variables, assuming a linear or nonlinear model of
dependency
• Examples:
• Predicting sales amounts of new product based on advertising expenditure
• Predicting the price of a house or car
• Predicting miles per gallon (MPG) of a car as a function of its weight and
horsepower
• Predicting wind velocities as a function of temperature, humidity, air
pressure, etc.
• Difference to classification: The predicted attribute is continuous,
while classification is used to predict nominal attributes (e.g.
yes/no)
73
Association Analysis

74
Association Analysis: Definition
• Given a set of records each of which contain some
number of items from a given collection
• discover frequent item sets and produce association
rules which will predict occurrence of an item based on
occurrences of other items

75
Association Rule Discovery:
Applications 1
• Application area: Supermarket shelf management.
• Goal: To identify items that are bought together by sufficiently
many customers
• Approach: Process the point-of-sale data collected with barcode
scanners to find dependencies among items
• A classic rule and its implications:
• if a customer buys diapers and milk, then he is likely to buy beer as well
• so, don’t be surprised if you find six-packs stacked next to diapers!
• promote diapers to boost beer sales
• if selling diapers is discontinued, this will affect beer sales as well
• Application area: Sales Promotion

76
Association Rule Discovery:
Application 2
• Application area: Inventory Management
• Goal: A consumer appliance repair company wants to
anticipate the nature of repairs on its consumer
products and keep the service vehicles equipped with
right parts to reduce on number of visits to consumer
households
• Approach: Process the data on tools and parts required
in previous repairs at different consumer locations and
discover the co-occurrence patterns

77
Selection and Exploration

78
Selection and Exploration
• Selection
• What data is potentially useful for the task at hand?
• What data is available?
• What do I know about the quality of the data?

• Exploration / Profiling
• Get an initial understanding of the data
• Calculate basic summarization statistics
• Visualize the data
• Identify data problems such as outliers, missing values,
duplicate records
79
Visualization & Data Mining
• Visualizing the data to facilitate human discovery
• Presenting the discovered results in a visually "nice"
way

80
Summarization
• Describe features of the selected group
• Use natural language and graphics
• Usually in Combination with Deviation detection or
other methods

Average length of stay in this study area rose 45.7


percent, from 4.3 days to 6.2 days, because ...

81
Data Mining Models and Tasks

82
Preprocessing and Transformation

83
Preprocessing and Transformation
• Transform data into a representation that is suitable for the chosen data
mining methods
• scales of attributes (nominal, ordinal, numeric)
• number of dimensions (represent relevant information using less attributes)
• amount of data (determines hardware requirements)
• Methods
• discretization and binarization
• feature subset selection / dimensionality reduction
• attribute transformation / text to term vector / embeddings
• aggregation, sampling
• integrate data from multiple sources
• Good data preparation is key to producing valid and reliable models
• Data integration and preparation is estimated to take 70-80% of the time
and effort of a data mining project 84
Data Mining

85
Data Mining
• Input: Preprocessed Data
• Output: Model / Patterns

1. Apply data mining method


2. Evaluate resulting model / patterns
3. Iterate
• experiment with different parameter settings
• experiment with multiple alternative methods
• improve preprocessing and feature generation
• increase amount or quality of training data
86
Deployment

87
Deployment
• Use model in the business context
• Keep iterating in order to maintain and improve model

1 2

• CRISP-DM Process Model Business


Understanding
Data
Understanding

3
Data
Preparation
Data Sources
6
4
Deployment
Model
Building

CRISP-DM (Cross-Industry Standard Testing and


Evaluation

Process for Data Mining)


88
How Do Data Scientists Spend Their
Days?

What data scientists spend the most time doing. [taken from
Data Scientist: The Dirtiest Job of the 21st Century | by Jingles (H
ong Jing) | TDS Archive | Medium 89
Data Mining Software

90
RapidMiner
• Powerful data mining suite
• Visual modelling of data mining pipelines
• Commercial tool, offering educational licenses

91
Gartner 2018 Magic Quadrant for Data
Science and Machine Learning Platforms

Gainers and Losers in Gartner 2018 Magic Quadrant for Data Scie
nce and Machine Learning Platforms - 92
KDnuggets
Performance Assessment Methods
• In classification problems, the primary source for
accuracy estimation is the confusion matrix
TP  TN
True Class Accuracy 
TP  TN  FP  FN
Positive Negative
TP
True False True Positive Rate 
Positive

TP  FN
Positive Positive
Predicted Class

Count (TP) Count (FP) TN


True Negative Rate 
TN  FP
Negative

False True
Negative Negative TP TP
Count (FN) Count (TN) P recision  Recall 
TP  FP TP  FN

93
Estimation Methodologies for
Classification
• Simple split (or holdout or test sample estimation)
• Split the data into 2 mutually exclusive sets training (~70%)
and testing (30%)
Model
Training Data Development
2/3

Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)

• For ANN, the data is split into three sub-sets


(training [~60%], validation [~20%], testing [~20%])
94
Estimation Methodologies for
Classification
• k-Fold Cross Validation (rotation estimation)
• Split the data into k mutually exclusive subsets
• Use each subset as testing while using the rest of the subsets
as training
• Repeat the experimentation for k times
• Aggregate the test results for true estimation of prediction
accuracy training
• Other estimation methodologies
• Leave-one-out, bootstrapping, jackknifing
• Area under the ROC curve

95
Estimation Methodologies for
Classification – ROC Curve
1

0.9

0.8
A

True Positive Rate (Sensitivity)


0.7

B
0.6

C
0.5

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

False Positive Rate (1 - Specificity)


96
You are welcome

97

You might also like